Logo VideoVista-CulturalLingo

360° Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension

1Harbin Institute of Technology, Shenzhen
2Hong Kong University of Science and Technology

Introduction

Assessing the video comprehension capabilities of multimodal AI systems can effectively measure their understanding and reasoning abilities. Most video evaluation benchmarks are limited to a single language, typically English, and predominantly feature videos rooted in Western cultural contexts. We introduce VideoVista-CulturalLingo, the first video evaluation benchmark designed to bridge cultural, linguistic, and domain divide in video comprehension. Our work differs from existing benchmarks in the following ways: 1. Cultural diversity, incorporating cultures from China, North America, and Europe; 2. Multi-linguistics, with questions presented in Chinese and English—two of the most widely spoken languages; 3. Broad domain, featuring videos sourced from hundreds of human-created domains. VideoVista-CulturalLingo contains 1,389 videos and 3,134 QA pairs, and we have evaluated 24 recent open-source or proprietary video large models.

Leaderboard on VideoVista-CulturalLingo

Accuracy scores on the VideoVista-CulturalLingo Dataset.

Overall(Average Score Across All Tasks)

Event(Score in Event Task) ; Object(Score in Object Task) ; Culture(Score in Culture Task) ; Science(Score in Science Task)

Proprietary LMMs
Open-source LMMs
# Model LLM Frames Overall Event Object Culture Science
1 Gemini-2.0-Flash Gemini-2.0-Flash 1fps 76.3 74.0 77.1 68.0 87.4
2 Gemini-2.0-Flash-Lite Gemini-2.0-Flash-Lite 1fps 70.7 63.1 71.6 63.1 82.1
3 Gemini-1.5-Flash Gemini-1.5-Flash 1fps 69.4 70.0 65.8 59.0 84.7
4 Qwen2.5-VL-72B Qwen2.5-72B-Instruct 1fps(300) 61.3 61.0 40.5 71.2 83.3
5 VideoLLaMA3 Qwen2.5-7B-Instruct 1fps(180) 60.7 58.0 66.4 53.1 64.4
6 GPT-4o-2024-11-20 GPT-4o 1fps(128) 56.7 53.4 38.2 68.0 78.3
7 Qwen2.5-VL-7B Qwen2.5-7B-Instruct 1fps(300) 54.3 56.7 38.9 55.2 73.3
8 InternVideo2.5 Internlm2.5-7b-Chat 1fps(512) 52.0 52.5 38.1 58.2 65.9
9 InternVL2.5 Internlm2.5-7b-Chat 64f 52.0 56.5 35.5 56.1 65.7
10 LLaVA-Video Qwen2-7B-Instruct 1fps(64) 51.0 57.9 39.1 48.8 60.3
11 TPO Qwen2-7B-Instruct 1fps(96) 50.6 57.2 37.8 49.6 60.4
12 mPLUG-Owl3 Qwen2-7B-Instruct 1fps(128) 49.9 54.4 41.9 45.0 60.1
13 Qwen2-VL Qwen2-7B-Instruct 1fps(300) 49.7 50.1 33.8 54.8 68.0
14 MiniCPM-o 2.6 Qwen2.5-7B-Instruct 1fps(64) 49.0 52.9 28.5 55.9 67.1
15 MiniCPM-V 2.6 Qwen2-7B-Instruct 1fps(64) 42.9 44.1 24.1 49.4 62.9
16 LLaVA-OneVision Qwen2-7B-Instruct 32f 41.8 43.9 33.8 38.8 53.5
17 Oryx-1.5 Qwen2.5-7B-Instruct 128f 41.4 43.8 32.2 37.6 55.8
18 Video-LLaVA Vicuna-7B-v1.5 8f 38.2 42.2 34.4 34.5 41.1
19 VideoLLaMA2 Mistral-7B-Instruct-v0.2 32f 31.4 33.6 23.3 34.9 36.6
20 VideoChat2-Mistral Mistral-7B-Instruct-v0.2 16f 29.6 27.5 25.9 34.7 33.1
21 ShareGPT4Video Vicuna-7B-v1.5 16f 25.6 23.2 18.9 31.4 34.1

Data Examples

Citation

@misc{chen2025videovistaculturallingo,
        title={VideoVista-CulturalLingo: 360$^\circ$ Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension}, 
        author={Xinyu Chen and Yunxin Li and Haoyuan Shi and Baotian Hu and Wenhan Luo and Yaowei Wang and Min Zhang},
        year={2025},
        eprint={2504.17821},
        archivePrefix={arXiv},
  }