FlagEval, launched by BAAI in 2023, is a comprehensive large model evaluation system that encompasses over 800 open-source and closed-source models from around the globe. It features more than 40 capability dimensions, including reasoning, mathematical skills, and task-solving abilities, along with five major tasks and four categories of metrics.
Project | Scope | GitHub |
---|---|---|
FlagEval | General‑purpose evaluation toolkit & platform for LLMs and multimodal foundation models; integrates >20 benchmarks across NLP, CV, Audio | https://github.com/flageval-baai/FlagEval |
Project | Description | GitHub |
---|---|---|
FlagEvalMM | Flexible framework for comprehensive multimodal model evaluation across text, image, and video tasks | https://github.com/flageval-baai/FlagEvalMM |
SeniorTalk | 55 h Mandarin speech dataset featuring 202 elderly speakers (75‑85 yrs) with rich annotations | https://github.com/flageval-baai/SeniorTalk |
ChildMandarin | 41 h child speech dataset covering 397 speakers (3‑5 yrs), balanced by gender & region | https://github.com/flageval-baai/ChildMandarin |
HalluDial | Large‑scale dialogue hallucination benchmark (spontaneous + induced scenarios, 147 k turns) | https://github.com/flageval-baai/HalluDial |
CMMU | IJCAI‑24 Chinese Multimodal Multi‑type Question benchmark (3 603 exam‑style Q&A) | https://github.com/flageval-baai/CMMU |
Repo | Highlights | Why It Matters | License |
---|---|---|---|
FlagEval | NLP/CV/Audio/Multimodal tasks; pipeline runners, leaderboard exporter | One‑stop hub for model & algorithm benchmarking | Apache‑2.0 |
FlagEvalMM | Multimodal eval harness with vLLM/SGLang adapters | Ready for GPT‑4o era, supports batch eval | Apache‑2.0 |
SeniorTalk | Elderly speech corpus | Enables ASR/TTS for super‑aged population | CC BY‑NC‑SA 4.0 |
ChildMandarin | Child speech corpus | Complements SeniorTalk, spans lifespan | CC BY‑NC‑SA 4.0 |
HalluDial | Dialogue hallucination dataset & metrics | First large‑scale hallucination localization benchmark | Apache‑2.0 |
CMMU | Multimodal Q&A exam | Stress‑tests domain knowledge & reasoning | MIT |
- Continuous Benchmarking: nightly runs on FlagScale with automated PR badges and regression alerts.
- Community Challenges: quarterly leaderboard sprints to surface emerging research directions.
We welcome issues & PRs! Please check each project’s CONTRIBUTING.md
and adhere to its license terms.
If you use any component of the ecosystem, please cite the corresponding paper listed in that project’s README.
This meta‑repository is released under Apache‑2.0. Individual projects may apply different licenses—see their respective READMEs.
Maintained by the FlagEval team · Last updated: 2025‑04‑23