🚀 Quickstart | 🌐 Homepage | 🏆 Leaderboard | 🤗 IntJudge | 📖 OpenING arXiv | 🖊️ Citation
This repository is the official implementation of OpenING (CVPR 2025 Oral).
OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation
Pengfei Zhou*, Xiaopeng Peng*, Jiajun Song, Chuanhao Li, Zhaopan Xu, Yue Yang, Ziyao Guo, Hao Zhang, Yuqi Lin, Yefei He, Lirui Zhao, Shuo Liu, Tianhua Li, Yuxuan Xie, Xiaojun Chang, Yu Qiao, Wenqi Shao, Kaipeng Zhang† * Equal Contribution
† Corresponding Author: [email protected]
2025/02/27
: The beta version of OpenING data can be accessed via Google Drive. If you have any questions, please contact us.2025/02/26
: Our paper is accepted by CVPR 2025, and selected as Oral. Thanks to all contributors.2024/11/29
: Our judge model IntJudge is released!2024/11/28
: We are releasing the evaluation code here.2024/11/27
: The technical report of OpenING is released! And check our project page!
We introduce OpenING, a comprehensive benchmark comprising 5,400 high-quality human-annotated instances across 56 real-world tasks. OpenING covers diverse daily scenarios such as travel guide, design, and brainstorming, offering a robust platform for challenging interleaved generation methods. In addition, we present IntJudge, a judge model for evaluating open-ended multimodal generation methods. Trained with a novel data pipeline, our IntJudge achieves an agreement rate of 82.42% with human judgments, outperforming GPT-based evaluators by 11.34%. Extensive experiments on OpenING reveal that current interleaved generation methods still have substantial room for improvement. Key findings on interleaved image-text generation are further presented to guide the development of next-generation generative models. We anticipate that more advanced multimodal judge models can be trained and tested on OpenING and we also believe that OpenING will push the boundaries of MLLMs towards general-purpose multimodal intelligence.
- An overview of model win rates evaluated by human, GPT-4o, and our IntJudge under FDT and different tie metrics. FDT: Force Dividing Tie metric. w/o Tie: Non-tie case. w/ Tie (0) and w/ Tie (.5): Count a tie as 0 and 0.5 wins for a model in a pairwise comparison, respectively. The best-performing model in each category is in-bold, and the second best is underlined.
Please refer to this to view the dynamic leaderboard.
Please refer to this to quick start.
The guidelines for the annotators emphasized strict compliance with copyright and licensing rules from the initial data source, specifically avoiding materials from websites that forbid copying and redistribution. Should you encounter any data samples potentially breaching the copyright or licensing regulations of any site, please feel free to contact us. Upon verification, we will immediately remove the potentially breaching samples.
- Pengfei Zhou: [email protected]
- Kaipeng Zhang: [email protected]
If you feel OpenING useful in your project or research, please kindly use the following BibTeX entry to cite our paper. Thanks!
@misc{zhou2024GATE,
title={GATE OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation},
author={Pengfei Zhou, Xiaopeng Peng, Jiajun Song, Chuanhao Li, Zhaopan Xu, Yue Yang, Ziyao Guo, Hao Zhang, Yuqi Lin, Yefei He, Lirui Zhao, Shuo Liu, Tianhua Li, Yuxuan Xie, Xiaojun Chang, Yu Qiao, Wenqi Shao, and Kaipeng Zhang},
year={2024},
eprint={2411.18499},
archivePrefix={arXiv},
primaryClass={cs.CV}
}