Skip to content

Vision-CAIR/Infinibench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows

🔥 News

  • [2025-06-10] This is a new released version of Infinibench.

Overview:

InfiniBench teaser figure InfiniBench skill set comprising eight skills. The right side represents skill categories and question types, while the left side provides examples of both multiple-choice (MCQ) and open-ended questions.

Understanding long-form videos, such as movies and TV episodes ranging from tens of minutes to two hours, remains a major challenge for multi-modal models. Existing benchmarks often fall short in testing the full range of cognitive skills needed to process these temporally rich and narratively complex inputs. We introduce InfiniBench, a comprehensive benchmark designed to rigorously evaluate the capabilities of models in long video understanding. InfiniBench offers: (1) Over 1,000 hours of video content, with an average video length of 52.59 minutes,(2) The largest set of question-answer pairs for long video comprehension, totaling around 91 K, (3) Eight diverse skills that span both grounding-based (e.g., scene transitions, character actions) and reasoning-based (e.g., deep context, multi-event linking) understanding, and (4) Rich annotation formats, including both multiple-choice and open-ended questions. We conduct an in-depth evaluation across both commercial (GPT-4o, Gemini 1.5 Flash) and open-source (Qwen2.5-VL, InternVL2.5) vision-language models. Results reveal that current models remain far from solving long video understanding: on grounding-based skills, the top open-source model (Qwen2.5-VL) and GPT-4o achieve only 39.4% and 48.1% accuracy, respectively. Interestingly, several models achieve non-trivial performance using only the movie or episode title, without watching the video, revealing a reliance on pre-trained world knowledge that partially compensates for the absence of visual or temporal understanding. These findings highlight critical gaps in current approaches and underscore the need for models that truly engage with long visual narratives.

🏆 Infinibench Leaderboard (test verified):

Models Frame Rate Grounding Skills Reasoning Skills Avg. Acc (0-100) Avg. Score (0-10)
Global Appearance Scene Transitions Character Actions Chronological Understanding Summarization Deep Context Understanding Spoiler Understanding Linking Events
Baseline Random--19.9619.7718.4136.45--------23.65--
GPT-4o45054.8243.7645.2966.246.356.924.016.7252.536.00
Gemini Flash 2.01 FPS49.0645.1457.6755.805.816.273.976.3851.925.61
Qwen2.5VL76833.1629.8529.3145.373.344.823.676.3934.424.56
Intern VL 3.012835.7329.6424.9643.733.924.133.636.1733.524.46
Qwen2VL76825.7931.0235.9143.072.254.903.296.0133.954.11
Goldfish (Mistral)60 FPW17.5523.6723.9939.373.005.423.696.4526.154.64
Video-Flash100022.0130.8137.6747.582.703.872.955.0234.523.64
LLava-Onevision12824.1927.8325.2646.502.004.093.316.1430.953.89
InternVL212827.4425.4823.7640.932.813.773.085.9329.403.90
InternVL2.512829.0526.6523.9936.262.513.142.325.0628.993.26
InternLM-XComposer16 FPW23.2729.5329.9942.781.672.842.465.0031.392.99
MiniGPT4-video (Mistral)6018.4925.1628.4941.062.813.113.083.8728.303.22
LongVU51226.5921.8623.7637.071.713.232.984.0927.323.00

InfiniBench leaderboard across eight skills. FPV (Frames Per Video), FPS (Frames Per Second), and FPW (Frames Per Window) are reported. All models in this evaluation utilize subtitles.

📊Benchmark statistics:

Skills statistics:

benchmark_statistics_1
InfiniBench skills statistics. (A) Number of questions per skill, (B) Number of videos per skill, and (C) Average video duration per skill

Videos source statistics:


Comparison between TV shows and Movies. (A) shows the number of questions, (B) represents the number of videos, (C) represents the Total video durations, and (D) shows The Minimum, Maximum, and average video duration for each video source

⬇️ Download The Benchmark

We are only provide annotations for already extisting videos datasets, namely TVQA and MovieNet.
We only preprocess the videos and subtitles for these datasets as mentioned in the paper to allign with the benchmark requirements.
To make it easier to use the benchmark, we have preprocessed the videos and subtitles for both TVQA and MovieNet datasets and you can directly download the preprocessed version from the table below.

Split Download link
Test (verified) Videos + Annotations
Train (not verified) Videos + Annotations

OR

You can download the original data and preprocess it using the scripts provided in this repository
View Videos preprocessing

💡 Benchmark Examples

Click to expand more examples

Benchmark annotations pipeline

View the data_genration/README.md for the full annotation pipeline details

Citation

If you're using InfiniBench in your research or applications, please cite using this BibTeX:

@misc{ataallah2024infinibenchcomprehensivebenchmarklarge,
      title={InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding}, 
      author={Kirolos Ataallah and Chenhui Gou and Eslam Abdelrahman and Khushbu Pahwa and Jian Ding and Mohamed Elhoseiny},
      year={2024},
      eprint={2406.19875},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2406.19875}, 
}

License

This repository is under BSD 3-Clause License.

About

Official InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published