Vertex AI Multi-Modal
Generic
Model Type
#3
Overall Rank
42.8%
Avg P@1
34.8%
Avg mAP@10
1408
Embed Dim
N/A
Input Res
8
Datasets
About This Model
Overview
Google's Vertex AI Multimodal Embeddings model is a foundation embedding model that projects text, images, and video into a shared semantic space. The model exposes a multimodalembedding@001 endpoint that outputs 1,408-dimensional vectors for all supported modalities.
Capabilities
The embeddings are designed for tasks such as:
- Semantic search
- Recommendation
- Content moderation
- Classification
- Similarity-based retrieval across modalities
Both image and text embeddings share the same dimensionality and space, enabling cross-modal queries (e.g. text-to-image retrieval).
Evaluation Setup
In our study, we use only the image embedding pathway and evaluate the model in a pure image-to-image retrieval setting, to understand how a general multimodal model behaves on industrial instance-level search tasks.
References
Performance Across Datasets
| Dataset | Category | P@1 | P@5 | R@1 | R@5 | mAP@10 |
|---|---|---|---|---|---|---|
| VPRC 2023 | Mixed Retail | 29.66% | 14.42% | 19.76% | 42.64% | 33.57% |
| Intercars | Automotive | 19.69% | 18.13% | 6.38% | 21.37% | 21.84% |
| Stanford Online Products | E-commerce | 76.88% | 51.66% | 19.05% | 50.09% | 54.68% |
| IKEA | Furniture | 52.29% | 33.26% | 15.04% | 36.71% | 36.93% |
| Hornbach | Hardware/DIY | 24.46% | 9.51% | 24.46% | 47.54% | 34.07% |
| ARaymond | Industrial | 8.77% | 6.17% | 0.55% | 1.93% | 2.47% |
| Products-10K | E-commerce | 63.29% | 39.70% | 13.50% | 41.66% | 40.88% |
| TOPEX | Industrial | 67.47% | 64.55% | 2.11% | 10.09% | 54.03% |
| Average | 42.81% | 29.68% | 12.61% | 31.50% | 34.81% | |