
MIRACL-VISION: A Large, multilingual, visual document retrieval benchmark
Table 5: NDCG@10 of text embedding models and visual embedding models on MIRACL-VISION.
MIRACL-VISION (Text) MIRACL-VISION (Image)
multil-
ingual-
e5-large
arctic-
embed-
l-v2.0
gte-
multilingual-
base
bge-m3
Llama-
3.2-1B
(inter-
nal)
dse-
qwen2-
2b-mrl-
v1
gme-
Qwen2-
VL-2B-
Instruct
vdr-2b-
multi-
v1
colqwen2-
v1.0
# Params (in M) 560 567 305 567 1235 1543 1543 1543 1543
Language
Arabic 0.8557 0.8754 0.8503 0.8883 0.8833 0.3893 0.4888 0.4379 0.4129
Bengali 0.8421 0.8325 0.8211 0.8585 0.7902 0.2352 0.3755 0.2473 0.2888
Chinese 0.6900 0.7179 0.7167 0.7458 0.7561 0.5962 0.6314 0.5963 0.4926
English 0.7029 0.7437 0.7345 0.7348 0.7721 0.6605 0.6784 0.6784 0.6417
Farsi 0.6793 0.7001 0.6984 0.7297 0.7192 0.2250 0.3085 0.2398 0.2616
Finnish 0.8974 0.9014 0.8957 0.9071 0.9097 0.4162 0.6863 0.5283 0.6604
French 0.7208 0.8236 0.7771 0.8158 0.8545 0.7160 0.6851 0.7194 0.6876
German 0.7622 0.7774 0.7498 0.7695 0.7823 0.6267 0.6345 0.6205 0.5995
Hindi 0.7595 0.7255 0.6916 0.7581 0.7770 0.1740 0.3127 0.2058 0.2209
Indonesian 0.6793 0.6906 0.6757 0.7049 0.6977 0.4866 0.5416 0.5254 0.5320
Japanese 0.8378 0.8484 0.8442 0.8720 0.8802 0.6232 0.7305 0.6553 0.6970
Korean 0.7327 0.7545 0.7397 0.7934 0.8088 0.4446 0.6202 0.4952 0.4419
Russian 0.7857 0.8242 0.8023 0.8363 0.8468 0.6505 0.7202 0.6995 0.6811
Spanish 0.6596 0.7250 0.7029 0.7268 0.7318 0.5927 0.6277 0.6274 0.6224
Swahili 0.8157 0.8089 0.7987 0.8337 0.8059 0.4156 0.5348 0.4509 0.4931
Telugu 0.8948 0.9201 0.9076 0.9090 0.8101 0.0274 0.0893 0.0318 0.0264
Thai 0.8424 0.8485 0.8509 0.8682 0.8673 0.2692 0.3563 0.3177 0.2389
Yoruba 0.5655 0.5332 0.5698 0.5842 0.5839 0.4178 0.4884 0.4577 0.5120
Average 0.7624 0.7806 0.7682 0.7964 0.7932 0.4426 0.5283 0.4741 0.4728
Average w/o Thelugu 0.7546 0.7724 0.7600 0.7898 0.7922 0.4670 0.5542 0.5002 0.4991
In the future, we plan to provide a MIRACL-VISION train split
and ne-tune visual embedding models on it. We also suggest en-
riching MIRACL-VISION with more modalities in multiple lan-
guages for multimodal multilingual evaluation.
References
[1]
Miquel Farré Elie Bakouch Pedro Cuenca Andres Maraoti, Merve Noyan. 2024.
SmolVLM - small yet mighty Vision Language Model. https://huggingface.co/
blog/smolvlm
[2]
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang
Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-VL: A Frontier Large Vision-
Language Model with Versatile Abilities. arXiv preprint arXiv:2308.12966 (2023).
[3]
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024.
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text
Embeddings Through Self-Knowledge Distillation. arXiv:2402.03216 [cs.CL]
[4]
Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline
Hudelot, and Pierre Colombo. 2024. ColPali: Ecient Document Retrieval with
Vision Language Models. arXiv:2407.01449 [cs.IR] https://arxiv.org/abs/2407.
01449
[5]
Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline
Hudelot, and Pierre Colombo. 2024. Colpali: Ecient document retrieval with
vision language models. In The Thirteenth International Conference on Learning
Representations.
[6]
Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey
Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense passage retrieval for open-
domain question answering. arXiv preprint arXiv:2004.04906 (2020).
[7]
Zhiqi Li, Guo Chen, Shilong Liu, Shihao Wang, Vibashan VS, Yishen Ji, Shiyi Lan,
Hao Zhang, Yilin Zhao, Subhashree Radhakrishnan, et al
.
2025. Eagle 2: Building
Post-Training Data Strategies from Scratch for Frontier Vision-Language Models.
arXiv preprint arXiv:2501.14818 (2025).
[8]
LlamaIndex. 2025. vdr-multilingual-test benchmark for visual document retrieval.
https://huggingface.co/datasets/llamaindex/vdr-multilingual-test
[9]
Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin.
2024. Unifying multimodal retrieval via document screenshot embedding. arXiv
preprint arXiv:2406.11251 (2024).
[10]
Andreas Steiner, André Susano Pinto, Michael Tschannen, Daniel Keysers, Xiao
Wang, Yonatan Bitton, Alexey Gritsenko, Matthias Minderer, Anthony Sherbondy,
Shangbang Long, et al
.
2024. Paligemma 2: A family of versatile vlms for transfer.
arXiv preprint arXiv:2412.03555 (2024).
[11]
Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna
Gurevych. 2021. Beir: A heterogenous benchmark for zero-shot evaluation of
information retrieval models. arXiv preprint arXiv:2104.08663 (2021).
[12]
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang,
Rangan Majumder, and Furu Wei. 2022. Text embeddings by weakly-supervised
contrastive pre-training. arXiv preprint arXiv:2212.03533 (2022).
[13]
Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and
Furu Wei. 2023. Improving text embeddings with large language models. arXiv
preprint arXiv:2401.00368 (2023).
[14]
Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and
Furu Wei. 2024. Multilingual e5 text embeddings: A technical report. arXiv
preprint arXiv:2402.05672 (2024).
[15]
Puxuan Yu, Luke Merrick, Gaurav Nuti, and Daniel Campos. 2024. Arctic-Embed
2.0: Multilingual Retrieval Without Compromise. arXiv preprint arXiv:2412.04506
(2024).
[16]
Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David
Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin.
2022. Making a miracl: Multilingual information retrieval across a continuum of
languages. arXiv preprint arXiv:2210.09984 (2022).
[17]
Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang,
Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, et al
.
2024. mgte: Generalized
long-context text representation and reranking models for multilingual text
retrieval. arXiv preprint arXiv:2407.19669 (2024).
[18]
Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long,
Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. 2024. GME: Improving
Universal Multimodal Retrieval by Multimodal LLMs. arXiv:2412.16855 [cs.CL]
http://arxiv.org/abs/2412.16855