Skip to content

Commit aa2b89b

Browse files
authored
[Update] Add CascadeEvaluator with Data Replica (#2022)
* Update CascadeEvaluator * Update CascadeEvaluator * Update CascadeEvaluator * Update Config * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update * Update
1 parent 7a7a451 commit aa2b89b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+1474
-272
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
6060
- **\[2025.04.01\]** OpenCompass now supports `CascadeEvaluator`, a flexible evaluation mechanism that allows multiple evaluators to work in sequence. This enables creating customized evaluation pipelines for complex assessment scenarios. Check out the [documentation](docs/en/advanced_guides/llm_judge.md) for more details! 🔥🔥🔥
6161
- **\[2025.03.11\]** We have supported evaluation for `SuperGPQA` which is a great benchmark for measuring LLM knowledge ability 🔥🔥🔥
6262
- **\[2025.02.28\]** We have added a tutorial for `DeepSeek-R1` series model, please check [Evaluating Reasoning Model](docs/en/user_guides/deepseek_r1.md) for more details! 🔥🔥🔥
63-
- **\[2025.02.15\]** We have added two powerful evaluation tools: `GenericLLMEvaluator` for LLM-as-judge evaluations and `MATHEvaluator` for mathematical reasoning assessments. Check out the documentation for [LLM Judge](docs/en/advanced_guides/llm_judge.md) and [Math Evaluation](docs/en/advanced_guides/general_math.md) for more details! 🔥🔥🔥
63+
- **\[2025.02.15\]** We have added two powerful evaluation tools: `GenericLLMEvaluator` for LLM-as-judge evaluations and `MATHVerifyEvaluator` for mathematical reasoning assessments. Check out the documentation for [LLM Judge](docs/en/advanced_guides/llm_judge.md) and [Math Evaluation](docs/en/advanced_guides/general_math.md) for more details! 🔥🔥🔥
6464
- **\[2025.01.16\]** We now support the [InternLM3-8B-Instruct](https://huggingface.co/internlm/internlm3-8b-instruct) model which has enhanced performance on reasoning and knowledge-intensive tasks.
6565
- **\[2024.12.17\]** We have provided the evaluation script for the December [CompassAcademic](examples/eval_academic_leaderboard_202412.py), which allows users to easily reproduce the official evaluation results by configuring it.
6666
- **\[2024.11.14\]** OpenCompass now offers support for a sophisticated benchmark designed to evaluate complex reasoning skills — [MuSR](https://arxiv.org/pdf/2310.16049). Check out the [demo](examples/eval_musr.py) and give it a spin! 🔥🔥🔥
@@ -246,7 +246,7 @@ Currently, OpenCompass have provided standard recommended configurations for dat
246246
opencompass --datasets aime2024_gen --models hf_internlm2_5_1_8b_chat
247247

248248
# Recommended Evaluation Config based on LLM Judge
249-
opencompass --datasets aime2024_llm_judge_gen --models hf_internlm2_5_1_8b_chat
249+
opencompass --datasets aime2024_llmjudge_gen --models hf_internlm2_5_1_8b_chat
250250
```
251251

252252
If you want to use multiple GPUs to evaluate the model in data parallel, you can use `--max-num-worker`.

README_zh-CN.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@
6060
- **\[2025.04.01\]** OpenCompass 现已支持 `CascadeEvaluator`,允许多个评估器按顺序工作,可以为更复杂的评估场景创建自定义评估流程,查看[文档](docs/zh_cn/advanced_guides/llm_judge.md)了解具体用法!🔥🔥🔥
6161
- **\[2025.03.11\]** 现已支持 `SuperGPQA` 覆盖285 个研究生学科的知识能力评测,欢迎尝试!🔥🔥🔥
6262
- **\[2025.02.28\]** 我们为 `DeepSeek-R1` 系列模型添加了教程,请查看 [评估推理模型](docs/zh_cn/user_guides/deepseek_r1.md) 了解更多详情!🔥🔥🔥
63-
- **\[2025.02.15\]** 我们新增了两个实用的评测工具:用于LLM作为评判器的`GenericLLMEvaluator`和用于数学推理评估的`MATHEvaluator`。查看[LLM评判器](docs/zh_cn/advanced_guides/llm_judge.md)[数学能力评测](docs/zh_cn/advanced_guides/general_math.md)文档了解更多详情!🔥🔥🔥
63+
- **\[2025.02.15\]** 我们新增了两个实用的评测工具:用于LLM作为评判器的`GenericLLMEvaluator`和用于数学推理评估的`MATHVerifyEvaluator`。查看[LLM评判器](docs/zh_cn/advanced_guides/llm_judge.md)[数学能力评测](docs/zh_cn/advanced_guides/general_math.md)文档了解更多详情!🔥🔥🔥
6464
- **\[2025.01.16\]** 我们现已支持 [InternLM3-8B-Instruct](https://huggingface.co/internlm/internlm3-8b-instruct) 模型,该模型在推理、知识类任务上取得同量级最优性能,欢迎尝试。
6565
- **\[2024.12.17\]** 我们提供了12月CompassAcademic学术榜单评估脚本 [CompassAcademic](configs/eval_academic_leaderboard_202412.py),你可以通过简单地配置复现官方评测结果。
6666
- **\[2024.10.14\]** 现已支持OpenAI多语言问答数据集[MMMLU](https://huggingface.co/datasets/openai/MMMLU),欢迎尝试! 🔥🔥🔥
@@ -237,7 +237,7 @@ humaneval, triviaqa, commonsenseqa, tydiqa, strategyqa, cmmlu, lambada, piqa, ce
237237
opencompass --datasets aime2024_gen --models hf_internlm2_5_1_8b_chat
238238

239239
# 基于LLM Judge的推荐配置
240-
opencompass --datasets aime2024_llm_judge_gen --models hf_internlm2_5_1_8b_chat
240+
opencompass --datasets aime2024_llmjudge_gen --models hf_internlm2_5_1_8b_chat
241241
```
242242

243243
此外,如果你想在多块 GPU 上使用模型进行推理,您可以使用 `--max-num-worker` 参数。

dataset-index.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -303,7 +303,7 @@
303303
category: Examination
304304
paper: https://huggingface.co/datasets/Maxwell-Jia/AIME_2024
305305
configpath: opencompass/configs/datasets/aime2024/aime2024_gen.py
306-
configpath_llmjudge: opencompass/configs/datasets/aime2024/aime2024_llm_judge_gen.py
306+
configpath_llmjudge: opencompass/configs/datasets/aime2024/aime2024_llmjudge_gen.py
307307
- anli:
308308
name: Adversarial NLI
309309
category: Reasoning

docs/en/advanced_guides/llm_judge.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -278,7 +278,7 @@ Here's an example of how to configure the CascadeEvaluator:
278278

279279
```python
280280
# Define a rule-based evaluator
281-
rule_evaluator = dict(type=MATHEvaluator)
281+
rule_evaluator = dict(type=MATHVerifyEvaluator)
282282

283283
# Define an LLM judge evaluator
284284
llm_judge_evaluator = dict(

docs/en/advanced_guides/math_verify.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## Introduction
44

5-
Mathematical reasoning is a crucial capability for large language models (LLMs). To evaluate a model's mathematical abilities, we need to test its capability to solve mathematical problems step by step and provide accurate final answers. OpenCompass provides a convenient way to evaluate mathematical reasoning through the CustomDataset and MATHEvaluator components.
5+
Mathematical reasoning is a crucial capability for large language models (LLMs). To evaluate a model's mathematical abilities, we need to test its capability to solve mathematical problems step by step and provide accurate final answers. OpenCompass provides a convenient way to evaluate mathematical reasoning through the CustomDataset and MATHVerifyEvaluator components.
66

77
## Dataset Format
88

@@ -61,7 +61,7 @@ math_infer_cfg = dict(
6161

6262
```python
6363
math_eval_cfg = dict(
64-
evaluator=dict(type=MATHEvaluator),
64+
evaluator=dict(type=MATHVerifyEvaluator),
6565
)
6666
```
6767

@@ -86,11 +86,11 @@ math_datasets = [
8686
]
8787
```
8888

89-
## MATHEvaluator
89+
## MATHVerifyEvaluator
9090

91-
The MATHEvaluator is specifically designed to evaluate mathematical answers. It is developed based on the math_verify library, which provides mathematical expression parsing and verification capabilities, supporting extraction and equivalence verification for both LaTeX and general expressions.
91+
The MATHVerifyEvaluator is specifically designed to evaluate mathematical answers. It is developed based on the math_verify library, which provides mathematical expression parsing and verification capabilities, supporting extraction and equivalence verification for both LaTeX and general expressions.
9292

93-
The MATHEvaluator implements:
93+
The MATHVerifyEvaluator implements:
9494

9595
1. Extracts answers from both predictions and references using LaTeX extraction
9696
2. Handles various LaTeX formats and environments
@@ -133,7 +133,7 @@ Here's a complete example of how to set up math evaluation:
133133
from mmengine.config import read_base
134134
from opencompass.models import TurboMindModelwithChatTemplate
135135
from opencompass.datasets import CustomDataset
136-
from opencompass.openicl.icl_evaluator.math_evaluator import MATHEvaluator
136+
from opencompass.openicl.icl_evaluator.math_evaluator import MATHVerifyEvaluator
137137
from opencompass.openicl.icl_prompt_template import PromptTemplate
138138
from opencompass.openicl.icl_retriever import ZeroRetriever
139139
from opencompass.openicl.icl_inferencer import GenInferencer
@@ -160,7 +160,7 @@ math_infer_cfg = dict(
160160

161161
# Evaluation configuration
162162
math_eval_cfg = dict(
163-
evaluator=dict(type=MATHEvaluator),
163+
evaluator=dict(type=MATHVerifyEvaluator),
164164
)
165165

166166
# Dataset configuration

docs/zh_cn/advanced_guides/llm_judge.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -277,7 +277,7 @@ OpenCompass还提供了级联评估器`CascadeEvaluator`,它结合了规则式
277277

278278
```python
279279
# 定义规则式评估器
280-
rule_evaluator = dict(type=MATHEvaluator)
280+
rule_evaluator = dict(type=MATHVerifyEvaluator)
281281

282282
# 定义LLM评判器
283283
llm_judge_evaluator = dict(

docs/zh_cn/advanced_guides/math_verify.md

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
## 简介
44

5-
数学推理能力是大语言模型(LLMs)的一项关键能力。为了评估模型的数学能力,我们需要测试其逐步解决数学问题并提供准确最终答案的能力。OpenCompass 通过 CustomDataset 和 MATHEvaluator 组件提供了一种便捷的数学推理评测方式。
5+
数学推理能力是大语言模型(LLMs)的一项关键能力。为了评估模型的数学能力,我们需要测试其逐步解决数学问题并提供准确最终答案的能力。OpenCompass 通过 CustomDataset 和 MATHVerifyEvaluator 组件提供了一种便捷的数学推理评测方式。
66

77
## 数据集格式
88

@@ -61,7 +61,7 @@ math_infer_cfg = dict(
6161

6262
```python
6363
math_eval_cfg = dict(
64-
evaluator=dict(type=MATHEvaluator),
64+
evaluator=dict(type=MATHVerifyEvaluator),
6565
)
6666
```
6767

@@ -86,11 +86,11 @@ math_datasets = [
8686
]
8787
```
8888

89-
## MATHEvaluator
89+
## MATHVerifyEvaluator
9090

91-
MATHEvaluator 是专门设计用于评估数学答案的评测器。它基于 math_verify 库进行开发,该库提供了数学表达式解析和验证功能,支持 LaTeX 和一般表达式的提取与等价性验证。
91+
MATHVerifyEvaluator 是专门设计用于评估数学答案的评测器。它基于 math_verify 库进行开发,该库提供了数学表达式解析和验证功能,支持 LaTeX 和一般表达式的提取与等价性验证。
9292

93-
MATHEvaluator 具有以下功能:
93+
MATHVerifyEvaluator 具有以下功能:
9494

9595
1. 使用 LaTeX 提取器从预测和参考答案中提取答案
9696
2. 处理各种 LaTeX 格式和环境
@@ -133,7 +133,7 @@ MATHEvaluator 具有以下功能:
133133
from mmengine.config import read_base
134134
from opencompass.models import TurboMindModelwithChatTemplate
135135
from opencompass.datasets import CustomDataset
136-
from opencompass.openicl.icl_evaluator.math_evaluator import MATHEvaluator
136+
from opencompass.evaluator import MATHVerifyEvaluator
137137
from opencompass.openicl.icl_prompt_template import PromptTemplate
138138
from opencompass.openicl.icl_retriever import ZeroRetriever
139139
from opencompass.openicl.icl_inferencer import GenInferencer
@@ -160,7 +160,7 @@ math_infer_cfg = dict(
160160

161161
# 评测配置
162162
math_eval_cfg = dict(
163-
evaluator=dict(type=MATHEvaluator),
163+
evaluator=dict(type=MATHVerifyEvaluator),
164164
)
165165

166166
# 数据集配置

examples/eval_cascade_evaluator.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,12 @@
77
from opencompass.openicl.icl_prompt_template import PromptTemplate
88
from opencompass.openicl.icl_retriever import ZeroRetriever
99
from opencompass.openicl.icl_inferencer import GenInferencer
10-
from opencompass.evaluator import GenericLLMEvaluator, CascadeEvaluator
10+
from opencompass.evaluator import (
11+
GenericLLMEvaluator,
12+
CascadeEvaluator,
13+
MATHVerifyEvaluator,
14+
)
1115
from opencompass.datasets import generic_llmjudge_postprocess
12-
from opencompass.openicl.icl_evaluator import MATHEvaluator
1316
from opencompass.datasets import (
1417
MATHDataset,
1518
math_postprocess_v2,
@@ -94,7 +97,7 @@
9497
judge_cfg=dict(),
9598
)
9699

97-
rule_evaluator =dict(type=MATHEvaluator)
100+
rule_evaluator =dict(type=MATHVerifyEvaluator)
98101
cascade_evaluator = dict(type=CascadeEvaluator,
99102
llm_evaluator=llm_judge_evaluator,
100103
rule_evaluator=rule_evaluator,

examples/eval_qwen3.py

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
2+
import os.path as osp
3+
from opencompass.models import OpenAISDK
4+
from mmengine.config import read_base
5+
from opencompass.utils.text_postprocessors import extract_non_reasoning_content
6+
from opencompass.runners import LocalRunner
7+
from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
8+
from opencompass.tasks import OpenICLInferTask, OpenICLEvalTask
9+
10+
with read_base():
11+
from opencompass.configs.datasets.aime2024.aime2024_cascade_eval_gen_5e9f4f import aime2024_datasets
12+
from opencompass.configs.datasets.aime2025.aime2025_cascade_eval_gen_5e9f4f import aime2025_datasets
13+
from opencompass.configs.datasets.math.math_500_cascade_eval_gen_6ff468 import math_datasets
14+
15+
#######################################################################
16+
# PART 0 Meta Info #
17+
#######################################################################
18+
19+
20+
api_meta_template = dict(round=[
21+
dict(role='HUMAN', api_role='HUMAN'),
22+
dict(role='BOT', api_role='BOT', generate=True),
23+
],
24+
)
25+
26+
27+
judge_cfg = dict(
28+
abbr='qwen2-5-32B-Instruct',
29+
type=OpenAISDK,
30+
path='Qwen/Qwen2.5-32B-Instruct',
31+
key='sk-1234',
32+
openai_api_base=[
33+
'http://x.x.x.x:4000/v1',
34+
],
35+
meta_template=api_meta_template,
36+
query_per_second=8,
37+
batch_size=256,
38+
temperature=0.001,
39+
# max_completion_tokens=32768,
40+
tokenizer_path='gpt-4o-2024-05-13',
41+
# verbose=True,
42+
max_out_len=16384,
43+
max_seq_len=32768,
44+
# max_seq_len=49152,
45+
mode='mid',
46+
retry=10
47+
)
48+
49+
#######################################################################
50+
# PART 1 Datasets List #
51+
#######################################################################
52+
53+
repeated_info = [
54+
(math_datasets, 4),
55+
(aime2024_datasets, 32),
56+
(aime2025_datasets, 32),
57+
]
58+
59+
for datasets_, num in repeated_info:
60+
for dataset_ in datasets_:
61+
dataset_['n'] = num
62+
63+
datasets = sum(
64+
(v for k, v in locals().items() if k.endswith('_datasets')),
65+
[],
66+
)
67+
68+
for item in datasets:
69+
item['infer_cfg']['inferencer']['max_out_len'] = 32768
70+
try:
71+
if 'judge_cfg' in item['eval_cfg']['evaluator']:
72+
item['eval_cfg']['evaluator']['judge_cfg'] = judge_cfg
73+
elif'judge_cfg' in item['eval_cfg']['evaluator']['llm_evaluator']:
74+
item['eval_cfg']['evaluator']['llm_evaluator']['judge_cfg'] = judge_cfg
75+
except:
76+
pass
77+
#######################################################################
78+
# PART 2 Dataset Summarizer #
79+
#######################################################################
80+
81+
summarizer = dict(
82+
dataset_abbrs=[
83+
'MATH',
84+
['math_prm800k_500', 'accuracy (4 runs average)'],
85+
['aime2024', 'accuracy (32 runs average)'],
86+
['aime2025', 'accuracy (32 runs average)'],
87+
['livemathbench_hard', 'naive_average'],
88+
['OlympiadBenchMath', 'accuracy'],
89+
['olymmath', 'naive_average'],
90+
],
91+
summary_groups = sum(
92+
[v for k, v in locals().items() if k.endswith('_summary_groups')], []
93+
),
94+
)
95+
96+
#######################################################################
97+
# PART 3 Models List #
98+
#######################################################################
99+
models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
100+
models += [
101+
102+
dict(
103+
abbr='Qwen_Qwen3-235B-A22B',
104+
type=OpenAISDK,
105+
path='Qwen/Qwen3-235B-A22B',
106+
key='sk-admin',
107+
openai_api_base=[
108+
'http://106.15.231.215:40007/v1/',
109+
],
110+
meta_template=dict(
111+
# begin=dict(role='SYSTEM', api_role='SYSTEM', prompt=''),
112+
round=[
113+
dict(role='HUMAN', api_role='HUMAN'),
114+
# XXX: all system roles are mapped to human in purpose
115+
dict(role='BOT', api_role='BOT', generate=True),
116+
]
117+
),
118+
query_per_second=16,
119+
batch_size=128,
120+
# batch_size=1,
121+
temperature=0.6,
122+
# max_completion_tokens=32768,
123+
tokenizer_path='gpt-4',
124+
# verbose=True,
125+
max_out_len=32768,
126+
max_seq_len=32768,
127+
pred_postprocessor=dict(type=extract_non_reasoning_content)
128+
),
129+
]
130+
131+
infer = dict(
132+
partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
133+
runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask)),
134+
)
135+
136+
eval = dict(
137+
partitioner=dict(type=NaivePartitioner, n=8),
138+
runner=dict(type=LocalRunner, task=dict(type=OpenICLEvalTask)),
139+
)
140+
141+
base_exp_dir = 'outputs/qwen3_reasoning'
142+
work_dir = osp.join(base_exp_dir, 'chat_objective')

opencompass/cli/main.py

Lines changed: 19 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,8 @@
1212
from opencompass.registry import PARTITIONERS, RUNNERS, build_from_cfg
1313
from opencompass.runners import SlurmRunner
1414
from opencompass.summarizers import DefaultSummarizer
15-
from opencompass.utils import (LarkReporter, get_logger, read_from_station,
16-
save_to_station)
15+
from opencompass.utils import (LarkReporter, get_logger, pretty_print_config,
16+
read_from_station, save_to_station)
1717
from opencompass.utils.run import (fill_eval_cfg, fill_infer_cfg,
1818
get_config_from_arg)
1919

@@ -94,6 +94,11 @@ def parse_args():
9494
help='Use the custom config directory instead of config/ to '
9595
'search the configs for datasets, models and summarizers',
9696
type=str)
97+
parser.add_argument(
98+
'--config-verbose',
99+
default=False,
100+
action='store_true',
101+
help='Whether to print the config in verbose mode.')
97102
parser.add_argument('-l',
98103
'--lark',
99104
help='Report the running status to lark bot',
@@ -131,7 +136,7 @@ def parse_args():
131136
'correctness of each sample, bpb, etc.',
132137
action='store_true',
133138
)
134-
139+
# for the results persistence
135140
parser.add_argument('-sp',
136141
'--station-path',
137142
help='Path to your results station.',
@@ -150,7 +155,12 @@ def parse_args():
150155
'data station.',
151156
action='store_true',
152157
)
153-
158+
# for evaluation with multiple runs
159+
parser.add_argument('--dataset-num-runs',
160+
help='How many runs for one dataset',
161+
type=int,
162+
default=1,
163+
)
154164

155165
# set srun args
156166
slurm_parser = parser.add_argument_group('slurm_args')
@@ -299,6 +309,11 @@ def main():
299309
content = f'{getpass.getuser()}\'s task has been launched!'
300310
LarkReporter(cfg['lark_bot_url']).post(content)
301311

312+
313+
# print config if specified --config-verbose
314+
if args.config_verbose:
315+
pretty_print_config(cfg)
316+
302317
# infer
303318
if args.mode in ['all', 'infer']:
304319
# When user have specified --slurm or --dlc, or have not set

0 commit comments

Comments
 (0)