Benchmark Report

Visual Product Search
Benchmark Report

A comprehensive evaluation of state-of-the-art visual embedding models for product search and retrieval applications.

Introduction

Embedding models have become a cornerstone of modern information retrieval systems. By converting data like images, text, or audio into vector representations, these models enable fast and scalable similarity search across large datasets. They now power applications ranging from search and recommendation to Retrieval-Augmented Generation (RAG), enterprise knowledge systems, and emerging agentic workflows.

In computer vision, embedding models often trained as general-purpose representation models serve as the backbone for downstream tasks such as image classification, object detection, and generative tasks like Stable Diffusion. Early foundation models such as CLIP, ALIGN, and CoCa, trained on web-scale image-text pairs, demonstrated strong zero-shot generalization across a wide range of visual and multimodal tasks. More recently, Vision-Language Models (VLMs) have been adapted into unified representation models like VLM2Vec, Jina V4, and Nomic Multimodal Embed. These models are appealing because they promise broad generalization without the need for task-specific fine-tuning.

Motivation

At nyris, visual product search is formulated as an instance retrieval problem: given a query image, the objective is to retrieve visually identical or near-identical product instances from a catalog containing thousands to millions of items. Our customers, primarily in manufacturing, industrial automation, and the automotive sector, operate in product ecosystems where hundreds to thousands of visually similar items coexist, often differing only in subtle shape features, fine-grained textures, minute variations introduced during production, and heterogeneous imaging conditions.

However, the assessment of foundation models on instance-level image retrieval remains limited due to shortcomings in existing benchmark datasets. Instance-level image retrieval is the task of retrieving the exact same physical object in a large collection of visually similar objects. Most public benchmarks focus on broader category-level or landmark-level retrieval, such as ROxford/RParis, GLDv2, or SOP, where objects are visually distinct and fine-grained instance discrimination is not central. The complexity and scale of existing datasets does not meet the demands of modern applications based on instance-level image retrieval, like product identification, visual search, and duplicate detection.

Objective

This study aims to provide transparent insights into the retrieval performance of current foundation embedding models, highlighting where they excel, where they fall short, and what this means for practical deployment in manufacturing and automotive environments. Our specific objectives are:

  • Provide a fair comparison of embedding models on product search tasks
  • Evaluate performance across diverse product categories and domains
  • Identify strengths and weaknesses of different modeling approaches
  • Guide practitioners in selecting models for their use cases

It is important to emphasize that the foundation models included in this evaluation were designed for broad, multimodal applications such as semantic search, vision-language alignment, and RAG pipelines. They might not be explicitly optimized for ultra fine-grained instance discrimination. Our comparison therefore focuses on how well these models perform without any domain-specific adaptation.

Scope

This study focuses on visual-only retrieval using image embeddings. We evaluate models on their ability to retrieve the correct product given a query image, measuring Precision, Recall, and mAP metrics at various cutoffs.

In Scope

  • Image-to-image retrieval
  • Pre-trained embedding models (no fine-tuning)
  • Product domain datasets from industry and academia
  • Precision@k, Recall@k, and mAP@k metrics

Out of Scope

  • Text-to-image retrieval
  • Fine-tuning comparisons
  • Latency and throughput benchmarks
  • Cost analysis

The recently introduced ILIAS benchmark is one of the first efforts to evaluate modern embedding models on instance-level retrieval at scale. While valuable, it focuses on consumer objects and web imagery, and therefore does not capture the extreme intra-class similarity, domain shifts, and imaging conditions typical in industrial settings.

Stanford Online Products (SOP) remains a widely used benchmark for deep metric learning. Products-10K extends this with larger scale and more challenging real-world conditions. However, both datasets predate the current generation of foundation models and may not fully stress-test their capabilities on fine-grained industrial products.

This study complements existing benchmarks by introducing evaluation on real-world industrial datasets from manufacturing, automotive, and retail domains, where the visual similarity between products is substantially higher than in consumer-focused benchmarks.

Datasets

Our evaluation spans 8 diverse datasets covering industry partners, academic benchmarks, and competition datasets:

Industry Datasets

  • Intercars: Automotive spare parts from a leading European distributor (17,965 products)
  • IKEA: Home furniture and accessories catalog (50,641 products)
  • Hornbach: DIY and home improvement products (127,597 products)
  • ARaymond: Industrial fastening and assembly solutions (12,531 products)

Academic Datasets

  • Stanford Online Products: Classic e-commerce benchmark with 11,316 products across 12 categories
  • Products-10K: Large-scale product recognition dataset with 10,000+ products
  • ILIAS: Large-scale instance retrieval benchmark with 5M+ reference images (CVPR 2025)

Competition Datasets

  • VPRC 2023: Visual Product Recognition Challenge featuring diverse retail products

The datasets employ two evaluation protocols: query-reference (separate query and catalog sets) and self-retrieval (same images serve as both queries and references).

Models

We evaluate 10 embedding models from 6 providers, spanning proprietary APIs, open-source foundation models, and domain-specialized systems:

Proprietary Models

  • nyris General V5.1: Broad-coverage visual embedding model optimized for industrial product identification
  • nyris Automotive V1: Domain-specialized model for automotive part retrieval
  • Google Vertex AI Multi-Modal: Foundation model for cross-modal retrieval via API
  • Cohere Embed V4: Enterprise multimodal embedding API

Open Source Foundation Models

  • Meta DINOv2 Large: Self-supervised vision transformer with strong visual features
  • Meta DINOv3 ViT-L/16: Latest iteration with improved fine-grained features
  • Meta PE-Core L/14: Vision model trained with contrastive vision-language objective
  • Google SigLIP2 SO400M: Shape-optimized architecture with sigmoid contrastive loss
  • Jina Embeddings V4: Multimodal embedding model optimized for search
  • Nomic Embed MM 3B: 3B parameter multimodal embedding model

Models vary in embedding dimensionality (768 to 1408), input resolution (224 to 384), and modality support (vision-only vs. multimodal).

Evaluation Protocol

Visual product search aims to rank the correct product as early as possible for a given query image, ideally at the first position. In practical systems, this must be achieved under real-world conditions that include visual ambiguity, incomplete product catalogs, and noise from uncontrolled image capture.

For these reasons, evaluation focuses on ranking quality rather than binary correctness, following established practices in information retrieval [1].

While immediate identification is the primary goal, presenting the top-5 to top-10 results increases the likelihood that the correct product appears within a reasonable ranking window. These ranked candidates are further consumed by post-processing steps such as re-ranking and business rules, making ranking quality a system-level concern. In this report, we are not re-ranking the results. Focusing only on the retrieval performance of embedding models.

The evaluation protocol is designed to answer three questions in ranked retrieval:

  • Does the system place the correct product at rank one?
  • If not, does it appear within the top-K results?
  • Does the retrieval stage provide a reliable candidate set for downstream processing?

Each evaluation query consists of a real-world product image. The gallery is a large product catalog that may contain multiple images of the same product instance as well as visually similar but distinct products. All evaluation in this setup is performed in a closed-set setting, meaning that the product depicted in each query is guaranteed to be present in the target database.

For each query, the system produces a ranked list of candidates ordered by similarity score. Evaluation is performed on this ranking, not on a binary accept or reject decision. The position of correct matches in the ranked list is therefore central to all reported metrics.

Dataset Setup

Evaluation has been conducted using two complementary dataset protocols, depending on data availability and the aspect of the system being validated. Both protocols are widely used in image retrieval literature and benchmarking practice [2].

Inter-Retrieval Evaluation

In the inter-retrieval setting, the dataset is explicitly split into a query set and a reference or gallery set. Queries do not appear verbatim in the reference split. Ground-truth relevance is defined through known instance-level associations between query and reference images, following standard retrieval benchmark design.

This setup closely reflects real-world deployment scenarios, where user queries originate outside the catalog. It provides a clean and unambiguous evaluation of retrieval and ranking performance and is the preferred protocol when such splits are available.

Intra-Retrieval Evaluation

In the intra-retrieval setting, the dataset consists of a single split. Each sample is treated as a query and matched against the entire dataset, including itself. To avoid trivial self-matches, the top-ranked result corresponding to the query itself is removed before evaluation, a common practice in this kind of evaluation.

This protocol is commonly used when explicit query–reference splits are unavailable. But results should be interpreted with care, as this setup can overestimate performance compared to deployment conditions.

Metrics

Visual product search is fundamentally a ranking problem, as emphasized in information retrieval literature [1]. The system must not only retrieve the correct product, but rank it as early as possible, ideally at the first position. As a result, the evaluation protocol prioritizes metrics that are sensitive to the ordering of results in the ranked list.

Rank-aware metrics explicitly account for the position at which relevant items appear. They reward systems that place correct matches early and penalize systems that retrieve correct results only at lower ranks. Precision@K, Recall@K, Average Precision, and Mean Average Precision all belong to this category and are standard metrics in ranked retrieval evaluation [1].

Rank-unaware metrics evaluate retrieval outcomes without considering order. Metrics such as accuracy or set-based precision and recall only measure whether a prediction is correct, not where it appears in the ranking. In large-scale retrieval systems, these metrics can obscure critical differences in ranking quality and are therefore not suitable for evaluating visual product search [1].

For this reason, all metrics used in this protocol are rank-aware and directly reflect ranking quality.

Precision

Precision measures how many of the retrieved results are correct. In the retrieval context, Precision@K is defined as the fraction of relevant items among the top-K ranked results [1].

Loading formula...

Precision is closely aligned with user perception. Precision@1 reflects whether the system identifies the product immediately, which strongly influences user trust. Precision at higher cutoffs such as 5 or 10 provides insight into the cleanliness and stability of the ranking under visual ambiguity.

Precision alone does not capture whether relevant products were missed entirely and must therefore be interpreted together with recall [1].

Recall

Recall measures how many of the relevant products are successfully retrieved within the top-K results. Recall@K is defined as the fraction of all relevant items that appear within the top-K ranked list [1].

Loading formula...

Recall is critical for system robustness. If recall is low, the correct product never enters the candidate set and cannot be recovered by re-ranking or post-processing. Such failures are often silent and difficult to detect without careful evaluation.

High recall does not guarantee good user experience, as relevant products may still be ranked too low. Recall should therefore be used to assess coverage rather than ranking quality.

Mean Average Precision (mAP)

Mean Average Precision is a rank-aware metric that summarizes retrieval quality across all recall levels and all queries. It is computed by first calculating the Average Precision for each query, which reflects how early and consistently relevant items appear in the ranking, and then averaging this value across the full query set [1].

Average Precision (AP) for a single query is computed as the sum of precision values at each relevant item position, divided by the total number of relevant items:

Loading formula...

where R = total number of relevant items, N = ranking depth, and rel(k) = 1 if the item at rank k is relevant, 0 otherwise.

Mean Average Precision (mAP) is then computed by averaging AP across all queries:

Loading formula...

where Q = total number of queries, and AP(q) = Average Precision for query q.

mAP is widely used in academic benchmarks and industrial retrieval systems, including standard image retrieval evaluations [2]. It provides a stable and reproducible basis for comparing models and captures performance on both easy and hard queries. While less intuitive than point metrics such as Precision@1, mAP offers the most reliable signal for offline model selection and long-term system improvement.

Conclusion

This study represents an initial but meaningful step toward establishing a more realistic understanding of vector-based visual retrieval performance beyond the consumer-focused datasets that dominate public benchmarks. By evaluating foundation models on real-world industrial datasets, we provide practitioners with actionable insights for model selection in manufacturing, automotive, and retail applications.

Key takeaways:

  • Foundation models show varying performance across domains, with no single model excelling everywhere
  • Domain-specialized models can significantly outperform general-purpose embeddings on specific verticals
  • Industrial datasets with high intra-class similarity remain challenging for current foundation models
  • The gap between academic benchmarks and real-world deployment scenarios is substantial

Each metric serves a distinct role in evaluation:

  • Precision@1 reflects immediate user-facing success and should be treated as a product-critical indicator.
  • Precision@K (K>1) captures ranking stability under uncertainty at higher cutoffs.
  • Recall@K ensures that the retrieval stage supplies sufficient candidates for downstream processing.
  • Mean Average Precision captures overall ranking quality and should be the primary metric for comparing retrieval models during development.

References

  1. Manning, C. D., Raghavan, P., and Schütze, H. Introduction to Information Retrieval. Cambridge University Press, 2008.
  2. Radenović, F., Tolias, G., and Chum, O. Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.

Future Work

We plan to expand this benchmark in several directions:

  • Additional models: Include emerging foundation models as they are released
  • Extended datasets: Add more industry verticals and larger-scale evaluations
  • Fine-tuning analysis: Evaluate the impact of domain-specific fine-tuning
  • Efficiency metrics: Add latency, throughput, and cost comparisons
  • Failure analysis: Detailed investigation of challenging cases and error patterns

We welcome contributions and feedback from the research community and industry practitioners.