PE-Core L/14

Overview

The Perception Encoder (PE) is a next-generation vision foundation model developed by Meta AI. It is part of a family of image and video encoders trained using a large-scale contrastive vision-language objective. Unlike earlier CLIP-style models, PE introduces improved training recipes, larger curated datasets, and alignment strategies that extract embeddings not only from the final layer but also from highly informative intermediate layers.

Architecture

The L-14/336 variant used in this study is built on a Vision Transformer Large architecture with a 336×336 image resolution. It produces 1,024-dimensional image embeddings designed to generalize broadly across classification, retrieval, and dense vision tasks.

Training

The model is trained on billions of image-text pairs and further refined with video-based alignment, resulting in strong robustness to domain shift and high-quality global representations.

Evaluation Setup

Although Perception Encoder is capable of multimodal and video-aware representation learning, in our evaluation we use it purely as an image-to-image retrieval encoder. This allows us to assess how well a modern, high-capacity vision-only foundation model performs in fine-grained industrial instance retrieval, where distinguishing between visually similar parts is critical.

Dataset	Category	P@1	P@5	R@1	R@5	mAP@10
VPRC 2023	Mixed Retail	32.97%	15.19%	21.90%	45.15%	36.49%
Intercars	Automotive	18.82%	16.59%	6.34%	20.62%	21.38%
Stanford Online Products	E-commerce	80.09%	54.56%	19.83%	52.67%	57.95%
IKEA	Furniture	38.27%	25.28%	9.16%	25.50%	25.14%
Hornbach	Hardware/DIY	25.20%	9.78%	25.20%	48.88%	33.80%
ARaymond	Industrial	12.14%	8.01%	0.76%	2.50%	3.31%
Products-10K	E-commerce	65.63%	38.80%	13.99%	40.74%	39.50%
TOPEX	Industrial	69.87%	65.48%	2.18%	10.23%	54.74%
Average	42.87%	29.21%	12.42%	30.79%	34.04%

Dataset

About This Model

Overview

Architecture

Training

Evaluation Setup

Performance Across Datasets