Skip to content

FlagOpen/FlagEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

768466d · Apr 23, 2025

History

2 Commits
Apr 23, 2025

Repository files navigation

FlagEval evaluation platform

FlagEval Logo


FlagEval, launched by BAAI in 2023, is a comprehensive large model evaluation system that encompasses over 800 open-source and closed-source models from around the globe. It features more than 40 capability dimensions, including reasoning, mathematical skills, and task-solving abilities, along with five major tasks and four categories of metrics.

🌟 FlagEval Core

Project Scope GitHub
FlagEval General‑purpose evaluation toolkit & platform for LLMs and multimodal foundation models; integrates >20 benchmarks across NLP, CV, Audio https://github.com/flageval-baai/FlagEval

🚀 Satellite Repositories

Project Description GitHub
FlagEvalMM Flexible framework for comprehensive multimodal model evaluation across text, image, and video tasks https://github.com/flageval-baai/FlagEvalMM
SeniorTalk 55 h Mandarin speech dataset featuring 202 elderly speakers (75‑85 yrs) with rich annotations https://github.com/flageval-baai/SeniorTalk
ChildMandarin 41 h child speech dataset covering 397 speakers (3‑5 yrs), balanced by gender & region https://github.com/flageval-baai/ChildMandarin
HalluDial Large‑scale dialogue hallucination benchmark (spontaneous + induced scenarios, 147 k turns) https://github.com/flageval-baai/HalluDial
CMMU IJCAI‑24 Chinese Multimodal Multi‑type Question benchmark (3 603 exam‑style Q&A) https://github.com/flageval-baai/CMMU

📚 Repository Matrix

Repo Highlights Why It Matters License
FlagEval NLP/CV/Audio/Multimodal tasks; pipeline runners, leaderboard exporter One‑stop hub for model & algorithm benchmarking Apache‑2.0
FlagEvalMM Multimodal eval harness with vLLM/SGLang adapters Ready for GPT‑4o era, supports batch eval Apache‑2.0
SeniorTalk Elderly speech corpus Enables ASR/TTS for super‑aged population CC BY‑NC‑SA 4.0
ChildMandarin Child speech corpus Complements SeniorTalk, spans lifespan CC BY‑NC‑SA 4.0
HalluDial Dialogue hallucination dataset & metrics First large‑scale hallucination localization benchmark Apache‑2.0
CMMU Multimodal Q&A exam Stress‑tests domain knowledge & reasoning MIT

🔭 Roadmap (2025‑2026)

  1. Continuous Benchmarking: nightly runs on FlagScale with automated PR badges and regression alerts.
  2. Community Challenges: quarterly leaderboard sprints to surface emerging research directions.

🤝 Contributing

We welcome issues & PRs! Please check each project’s CONTRIBUTING.md and adhere to its license terms.


📄 Citation

If you use any component of the ecosystem, please cite the corresponding paper listed in that project’s README.


🛡️ License

This meta‑repository is released under Apache‑2.0. Individual projects may apply different licenses—see their respective READMEs.


Maintained by the FlagEval team · Last updated: 2025‑04‑23

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published