InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows

[

Project Page] [📝 arXiv Paper] [🤗 Download] [🏆Leaderboard]

🔥 News

[2025-06-10] This is a new released version of Infinibench.

Overview:

InfiniBench skill set comprising eight skills. The right side represents skill categories and question types, while the left side provides examples of both multiple-choice (MCQ) and open-ended questions.

Understanding long-form videos, such as movies and TV episodes ranging from tens of minutes to two hours, remains a major challenge for multi-modal models. Existing benchmarks often fall short in testing the full range of cognitive skills needed to process these temporally rich and narratively complex inputs. We introduce InfiniBench, a comprehensive benchmark designed to rigorously evaluate the capabilities of models in long video understanding. InfiniBench offers: (1) Over 1,000 hours of video content, with an average video length of 52.59 minutes,(2) The largest set of question-answer pairs for long video comprehension, totaling around 91 K, (3) Eight diverse skills that span both grounding-based (e.g., scene transitions, character actions) and reasoning-based (e.g., deep context, multi-event linking) understanding, and (4) Rich annotation formats, including both multiple-choice and open-ended questions. We conduct an in-depth evaluation across both commercial (GPT-4o, Gemini 1.5 Flash) and open-source (Qwen2.5-VL, InternVL2.5) vision-language models. Results reveal that current models remain far from solving long video understanding: on grounding-based skills, the top open-source model (Qwen2.5-VL) and GPT-4o achieve only 39.4% and 48.1% accuracy, respectively. Interestingly, several models achieve non-trivial performance using only the movie or episode title, without watching the video, revealing a reliance on pre-trained world knowledge that partially compensates for the absence of visual or temporal understanding. These findings highlight critical gaps in current approaches and underscore the need for models that truly engage with long visual narratives.

🏆 Infinibench Leaderboard (test verified):

Models	Frame Rate	Grounding Skills				Reasoning Skills				Avg. Acc (0-100)	Avg. Score (0-10)
Models	Frame Rate	Global Appearance	Scene Transitions	Character Actions	Chronological Understanding	Summarization	Deep Context Understanding	Spoiler Understanding	Linking Events	Avg. Acc (0-100)	Avg. Score (0-10)
Baseline Random	--	19.96	19.77	18.41	36.45	--	--	--	--	23.65	--
GPT-4o	450	54.82	43.76	45.29	66.24	6.35	6.92	4.01	6.72	52.53	6.00
Gemini Flash 2.0	1 FPS	49.06	45.14	57.67	55.80	5.81	6.27	3.97	6.38	51.92	5.61
Qwen2.5VL	768	33.16	29.85	29.31	45.37	3.34	4.82	3.67	6.39	34.42	4.56
Intern VL 3.0	128	35.73	29.64	24.96	43.73	3.92	4.13	3.63	6.17	33.52	4.46
Qwen2VL	768	25.79	31.02	35.91	43.07	2.25	4.90	3.29	6.01	33.95	4.11
Goldfish (Mistral)	60 FPW	17.55	23.67	23.99	39.37	3.00	5.42	3.69	6.45	26.15	4.64
Video-Flash	1000	22.01	30.81	37.67	47.58	2.70	3.87	2.95	5.02	34.52	3.64
LLava-Onevision	128	24.19	27.83	25.26	46.50	2.00	4.09	3.31	6.14	30.95	3.89
InternVL2	128	27.44	25.48	23.76	40.93	2.81	3.77	3.08	5.93	29.40	3.90
InternVL2.5	128	29.05	26.65	23.99	36.26	2.51	3.14	2.32	5.06	28.99	3.26
InternLM-XComposer	16 FPW	23.27	29.53	29.99	42.78	1.67	2.84	2.46	5.00	31.39	2.99
MiniGPT4-video (Mistral)	60	18.49	25.16	28.49	41.06	2.81	3.11	3.08	3.87	28.30	3.22
LongVU	512	26.59	21.86	23.76	37.07	1.71	3.23	2.98	4.09	27.32	3.00

InfiniBench leaderboard across eight skills. FPV (Frames Per Video), FPS (Frames Per Second), and FPW (Frames Per Window) are reported. All models in this evaluation utilize subtitles.

📊Benchmark statistics:

Skills statistics:

InfiniBench skills statistics. (A) Number of questions per skill, (B) Number of videos per skill, and (C) Average video duration per skill

Videos source statistics:

Comparison between TV shows and Movies. (A) shows the number of questions, (B) represents the number of videos, (C) represents the Total video durations, and (D) shows The Minimum, Maximum, and average video duration for each video source

⬇️ Download The Benchmark

We are only provide annotations for already extisting videos datasets, namely TVQA and MovieNet.
We only preprocess the videos and subtitles for these datasets as mentioned in the paper to allign with the benchmark requirements.
To make it easier to use the benchmark, we have preprocessed the videos and subtitles for both TVQA and MovieNet datasets and you can directly download the preprocessed version from the table below.

Split	Download link
Test (verified)	Videos + Annotations
Train (not verified)	Videos + Annotations

OR

You can download the original data and preprocess it using the scripts provided in this repository
View Videos preprocessing

💡 Benchmark Examples

Click to expand more examples

Benchmark annotations pipeline

View the data_genration/README.md for the full annotation pipeline details

Citation

If you're using InfiniBench in your research or applications, please cite using this BibTeX:

@misc{ataallah2024infinibenchcomprehensivebenchmarklarge,
      title={InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding}, 
      author={Kirolos Ataallah and Chenhui Gou and Eslam Abdelrahman and Khushbu Pahwa and Jian Ding and Mohamed Elhoseiny},
      year={2024},
      eprint={2406.19875},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2406.19875}, 
}

License

This repository is under BSD 3-Clause License.

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
data_genration		data_genration
evaluation		evaluation
figs		figs
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows

🔥 News

Overview:

🏆 Infinibench Leaderboard (test verified):

📊Benchmark statistics:

Skills statistics:

Videos source statistics:

⬇️ Download The Benchmark

💡 Benchmark Examples

Benchmark annotations pipeline

Citation

License

About

Uh oh!

Releases

Packages

Languages

License

Vision-CAIR/Infinibench

Folders and files

Latest commit

History

Repository files navigation

InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows

🔥 News

Overview:

🏆 Infinibench Leaderboard (test verified):

📊Benchmark statistics:

Skills statistics:

Videos source statistics:

⬇️ Download The Benchmark

💡 Benchmark Examples

Benchmark annotations pipeline

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages