PE-Core L/14
About This Model
Overview
The Perception Encoder (PE) is a next-generation vision foundation model developed by Meta AI. It is part of a family of image and video encoders trained using a large-scale contrastive vision-language objective. Unlike earlier CLIP-style models, PE introduces improved training recipes, larger curated datasets, and alignment strategies that extract embeddings not only from the final layer but also from highly informative intermediate layers.
Architecture
The L-14/336 variant used in this study is built on a Vision Transformer Large architecture with a 336×336 image resolution. It produces 1,024-dimensional image embeddings designed to generalize broadly across classification, retrieval, and dense vision tasks.
Training
The model is trained on billions of image-text pairs and further refined with video-based alignment, resulting in strong robustness to domain shift and high-quality global representations.
Evaluation Setup
Although Perception Encoder is capable of multimodal and video-aware representation learning, in our evaluation we use it purely as an image-to-image retrieval encoder. This allows us to assess how well a modern, high-capacity vision-only foundation model performs in fine-grained industrial instance retrieval, where distinguishing between visually similar parts is critical.
Performance Across Datasets
| Dataset | Category | P@1 | P@5 | R@1 | R@5 | mAP@10 |
|---|---|---|---|---|---|---|
| VPRC 2023 | Mixed Retail | 32.97% | 15.19% | 21.90% | 45.15% | 36.49% |
| Intercars | Automotive | 18.82% | 16.59% | 6.34% | 20.62% | 21.38% |
| Stanford Online Products | E-commerce | 80.09% | 54.56% | 19.83% | 52.67% | 57.95% |
| IKEA | Furniture | 38.27% | 25.28% | 9.16% | 25.50% | 25.14% |
| Hornbach | Hardware/DIY | 25.20% | 9.78% | 25.20% | 48.88% | 33.80% |
| ARaymond | Industrial | 12.14% | 8.01% | 0.76% | 2.50% | 3.31% |
| Products-10K | E-commerce | 65.63% | 38.80% | 13.99% | 40.74% | 39.50% |
| TOPEX | Industrial | 69.87% | 65.48% | 2.18% | 10.23% | 54.74% |
| Average | 42.87% | 29.21% | 12.42% | 30.79% | 34.04% | |
arXiv