- [2025-06-10] This is a new released version of Infinibench.
InfiniBench skill set comprising eight skills. The right side represents skill categories and question types, while the left side provides examples of both multiple-choice (MCQ) and open-ended questions.
Understanding long-form videos, such as movies and TV episodes ranging from tens of minutes to two hours, remains a major challenge for multi-modal models. Existing benchmarks often fall short in testing the full range of cognitive skills needed to process these temporally rich and narratively complex inputs. We introduce InfiniBench, a comprehensive benchmark designed to rigorously evaluate the capabilities of models in long video understanding. InfiniBench offers: (1) Over 1,000 hours of video content, with an average video length of 52.59 minutes,(2) The largest set of question-answer pairs for long video comprehension, totaling around 91 K, (3) Eight diverse skills that span both grounding-based (e.g., scene transitions, character actions) and reasoning-based (e.g., deep context, multi-event linking) understanding, and (4) Rich annotation formats, including both multiple-choice and open-ended questions. We conduct an in-depth evaluation across both commercial (GPT-4o, Gemini 1.5 Flash) and open-source (Qwen2.5-VL, InternVL2.5) vision-language models. Results reveal that current models remain far from solving long video understanding: on grounding-based skills, the top open-source model (Qwen2.5-VL) and GPT-4o achieve only 39.4% and 48.1% accuracy, respectively. Interestingly, several models achieve non-trivial performance using only the movie or episode title, without watching the video, revealing a reliance on pre-trained world knowledge that partially compensates for the absence of visual or temporal understanding. These findings highlight critical gaps in current approaches and underscore the need for models that truly engage with long visual narratives.
Models | Frame Rate | Grounding Skills | Reasoning Skills | Avg. Acc (0-100) | Avg. Score (0-10) | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Global Appearance | Scene Transitions | Character Actions | Chronological Understanding | Summarization | Deep Context Understanding | Spoiler Understanding | Linking Events | ||||
Baseline Random | -- | 19.96 | 19.77 | 18.41 | 36.45 | -- | -- | -- | -- | 23.65 | -- |
GPT-4o | 450 | 54.82 | 43.76 | 45.29 | 66.24 | 6.35 | 6.92 | 4.01 | 6.72 | 52.53 | 6.00 |
Gemini Flash 2.0 | 1 FPS | 49.06 | 45.14 | 57.67 | 55.80 | 5.81 | 6.27 | 3.97 | 6.38 | 51.92 | 5.61 |
Qwen2.5VL | 768 | 33.16 | 29.85 | 29.31 | 45.37 | 3.34 | 4.82 | 3.67 | 6.39 | 34.42 | 4.56 |
Intern VL 3.0 | 128 | 35.73 | 29.64 | 24.96 | 43.73 | 3.92 | 4.13 | 3.63 | 6.17 | 33.52 | 4.46 |
Qwen2VL | 768 | 25.79 | 31.02 | 35.91 | 43.07 | 2.25 | 4.90 | 3.29 | 6.01 | 33.95 | 4.11 |
Goldfish (Mistral) | 60 FPW | 17.55 | 23.67 | 23.99 | 39.37 | 3.00 | 5.42 | 3.69 | 6.45 | 26.15 | 4.64 |
Video-Flash | 1000 | 22.01 | 30.81 | 37.67 | 47.58 | 2.70 | 3.87 | 2.95 | 5.02 | 34.52 | 3.64 |
LLava-Onevision | 128 | 24.19 | 27.83 | 25.26 | 46.50 | 2.00 | 4.09 | 3.31 | 6.14 | 30.95 | 3.89 |
InternVL2 | 128 | 27.44 | 25.48 | 23.76 | 40.93 | 2.81 | 3.77 | 3.08 | 5.93 | 29.40 | 3.90 |
InternVL2.5 | 128 | 29.05 | 26.65 | 23.99 | 36.26 | 2.51 | 3.14 | 2.32 | 5.06 | 28.99 | 3.26 |
InternLM-XComposer | 16 FPW | 23.27 | 29.53 | 29.99 | 42.78 | 1.67 | 2.84 | 2.46 | 5.00 | 31.39 | 2.99 |
MiniGPT4-video (Mistral) | 60 | 18.49 | 25.16 | 28.49 | 41.06 | 2.81 | 3.11 | 3.08 | 3.87 | 28.30 | 3.22 |
LongVU | 512 | 26.59 | 21.86 | 23.76 | 37.07 | 1.71 | 3.23 | 2.98 | 4.09 | 27.32 | 3.00 |
InfiniBench leaderboard across eight skills. FPV (Frames Per Video), FPS (Frames Per Second), and FPW (Frames Per Window) are reported. All models in this evaluation utilize subtitles.
InfiniBench skills statistics. (A) Number of questions per skill, (B) Number of videos per skill, and (C) Average video duration per skill
Comparison between TV shows and Movies. (A) shows the number of questions, (B) represents the number of videos, (C) represents the Total video durations, and (D) shows The Minimum, Maximum, and average video duration for each video source
We are only provide annotations for already extisting videos datasets, namely TVQA and MovieNet.
We only preprocess the videos and subtitles for these datasets as mentioned in the paper to allign with the benchmark requirements.
To make it easier to use the benchmark, we have preprocessed the videos and subtitles for both TVQA and MovieNet datasets and you can directly download the preprocessed version from the table below.
Split | Download link |
---|---|
Test (verified) | Videos + Annotations |
Train (not verified) | Videos + Annotations |
OR
You can download the original data and preprocess it using the scripts provided in this repository
View Videos preprocessing
View the data_genration/README.md for the full annotation pipeline details
If you're using InfiniBench in your research or applications, please cite using this BibTeX:
@misc{ataallah2024infinibenchcomprehensivebenchmarklarge,
title={InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding},
author={Kirolos Ataallah and Chenhui Gou and Eslam Abdelrahman and Khushbu Pahwa and Jian Ding and Mohamed Elhoseiny},
year={2024},
eprint={2406.19875},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2406.19875},
}
This repository is under BSD 3-Clause License.