open-compass
diff --git a/‎README.md
Lines changed: 2 additions & 2 deletions b/‎README.md
Lines changed: 2 additions & 2 deletions
diff --git a/‎README_zh-CN.md
Lines changed: 2 additions & 2 deletions b/‎README_zh-CN.md
Lines changed: 2 additions & 2 deletions
diff --git a/‎dataset-index.yml
Lines changed: 1 addition & 1 deletion b/‎dataset-index.yml
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/en/advanced_guides/llm_judge.md
Lines changed: 1 addition & 1 deletion b/‎docs/en/advanced_guides/llm_judge.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/en/advanced_guides/math_verify.md
Lines changed: 7 additions & 7 deletions b/‎docs/en/advanced_guides/math_verify.md
Lines changed: 7 additions & 7 deletions
diff --git a/‎docs/zh_cn/advanced_guides/llm_judge.md
Lines changed: 1 addition & 1 deletion b/‎docs/zh_cn/advanced_guides/llm_judge.md
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/zh_cn/advanced_guides/math_verify.md
Lines changed: 7 additions & 7 deletions b/‎docs/zh_cn/advanced_guides/math_verify.md
Lines changed: 7 additions & 7 deletions
diff --git a/‎examples/eval_cascade_evaluator.py
Lines changed: 6 additions & 3 deletions b/‎examples/eval_cascade_evaluator.py
Lines changed: 6 additions & 3 deletions
diff --git a/‎examples/eval_qwen3.py
Lines changed: 142 additions & 0 deletions b/‎examples/eval_qwen3.py
Lines changed: 142 additions & 0 deletions
diff --git a/‎opencompass/cli/main.py
Lines changed: 19 additions & 4 deletions b/‎opencompass/cli/main.py
Lines changed: 19 additions & 4 deletions
@@ -60,7 +60,7 @@ Just like a compass guides us on our journey, OpenCompass will guide you through
 - **\[2025.04.01\]** OpenCompass now supports `CascadeEvaluator`, a flexible evaluation mechanism that allows multiple evaluators to work in sequence. This enables creating customized evaluation pipelines for complex assessment scenarios. Check out the [documentation](docs/en/advanced_guides/llm_judge.md) for more details! 🔥🔥🔥
 - **\[2025.03.11\]** We have supported evaluation for `SuperGPQA` which is a great benchmark for measuring LLM knowledge ability 🔥🔥🔥
 - **\[2025.02.28\]** We have added a tutorial for `DeepSeek-R1` series model, please check [Evaluating Reasoning Model](docs/en/user_guides/deepseek_r1.md) for more details! 🔥🔥🔥
-- **\[2025.02.15\]** We have added two powerful evaluation tools: `GenericLLMEvaluator` for LLM-as-judge evaluations and `MATHEvaluator` for mathematical reasoning assessments. Check out the documentation for [LLM Judge](docs/en/advanced_guides/llm_judge.md) and [Math Evaluation](docs/en/advanced_guides/general_math.md) for more details! 🔥🔥🔥
+- **\[2025.02.15\]** We have added two powerful evaluation tools: `GenericLLMEvaluator` for LLM-as-judge evaluations and `MATHVerifyEvaluator` for mathematical reasoning assessments. Check out the documentation for [LLM Judge](docs/en/advanced_guides/llm_judge.md) and [Math Evaluation](docs/en/advanced_guides/general_math.md) for more details! 🔥🔥🔥
 - **\[2025.01.16\]** We now support the [InternLM3-8B-Instruct](https://huggingface.co/internlm/internlm3-8b-instruct) model which has enhanced performance on reasoning and knowledge-intensive tasks.
 - **\[2024.12.17\]** We have provided the evaluation script for the December [CompassAcademic](examples/eval_academic_leaderboard_202412.py), which allows users to easily reproduce the official evaluation results by configuring it.
 - **\[2024.11.14\]** OpenCompass now offers support for a sophisticated benchmark designed to evaluate complex reasoning skills — [MuSR](https://arxiv.org/pdf/2310.16049). Check out the [demo](examples/eval_musr.py) and give it a spin! 🔥🔥🔥
@@ -246,7 +246,7 @@ Currently, OpenCompass have provided standard recommended configurations for dat
 opencompass --datasets aime2024_gen --models hf_internlm2_5_1_8b_chat
 
 # Recommended Evaluation Config based on LLM Judge
-opencompass --datasets aime2024_llm_judge_gen --models hf_internlm2_5_1_8b_chat
+opencompass --datasets aime2024_llmjudge_gen --models hf_internlm2_5_1_8b_chat
 ```
 
 If you want to use multiple GPUs to evaluate the model in data parallel, you can use `--max-num-worker`.
 
@@ -60,7 +60,7 @@
 - **\[2025.04.01\]** OpenCompass 现已支持 `CascadeEvaluator`，允许多个评估器按顺序工作，可以为更复杂的评估场景创建自定义评估流程，查看[文档](docs/zh_cn/advanced_guides/llm_judge.md)了解具体用法！🔥🔥🔥
 - **\[2025.03.11\]** 现已支持 `SuperGPQA`  覆盖285 个研究生学科的知识能力评测，欢迎尝试！🔥🔥🔥
 - **\[2025.02.28\]** 我们为 `DeepSeek-R1` 系列模型添加了教程，请查看 [评估推理模型](docs/zh_cn/user_guides/deepseek_r1.md) 了解更多详情！🔥🔥🔥
-- **\[2025.02.15\]** 我们新增了两个实用的评测工具：用于LLM作为评判器的`GenericLLMEvaluator`和用于数学推理评估的`MATHEvaluator`。查看[LLM评判器](docs/zh_cn/advanced_guides/llm_judge.md)和[数学能力评测](docs/zh_cn/advanced_guides/general_math.md)文档了解更多详情！🔥🔥🔥
+- **\[2025.02.15\]** 我们新增了两个实用的评测工具：用于LLM作为评判器的`GenericLLMEvaluator`和用于数学推理评估的`MATHVerifyEvaluator`。查看[LLM评判器](docs/zh_cn/advanced_guides/llm_judge.md)和[数学能力评测](docs/zh_cn/advanced_guides/general_math.md)文档了解更多详情！🔥🔥🔥
 - **\[2025.01.16\]** 我们现已支持 [InternLM3-8B-Instruct](https://huggingface.co/internlm/internlm3-8b-instruct) 模型，该模型在推理、知识类任务上取得同量级最优性能，欢迎尝试。
 - **\[2024.12.17\]** 我们提供了12月CompassAcademic学术榜单评估脚本 [CompassAcademic](configs/eval_academic_leaderboard_202412.py)，你可以通过简单地配置复现官方评测结果。
 - **\[2024.10.14\]** 现已支持OpenAI多语言问答数据集[MMMLU](https://huggingface.co/datasets/openai/MMMLU)，欢迎尝试! 🔥🔥🔥
@@ -237,7 +237,7 @@ humaneval, triviaqa, commonsenseqa, tydiqa, strategyqa, cmmlu, lambada, piqa, ce
   opencompass --datasets aime2024_gen --models hf_internlm2_5_1_8b_chat
 
   # 基于LLM Judge的推荐配置
-  opencompass --datasets aime2024_llm_judge_gen --models hf_internlm2_5_1_8b_chat
+  opencompass --datasets aime2024_llmjudge_gen --models hf_internlm2_5_1_8b_chat
   ```
 
   此外，如果你想在多块 GPU 上使用模型进行推理，您可以使用 `--max-num-worker` 参数。
 
@@ -303,7 +303,7 @@
     category: Examination
     paper: https://huggingface.co/datasets/Maxwell-Jia/AIME_2024
     configpath: opencompass/configs/datasets/aime2024/aime2024_gen.py
-    configpath_llmjudge: opencompass/configs/datasets/aime2024/aime2024_llm_judge_gen.py
+    configpath_llmjudge: opencompass/configs/datasets/aime2024/aime2024_llmjudge_gen.py
 - anli:
     name: Adversarial NLI
     category: Reasoning
 
@@ -278,7 +278,7 @@ Here's an example of how to configure the CascadeEvaluator:
 
 ```python
 # Define a rule-based evaluator
-rule_evaluator = dict(type=MATHEvaluator)
+rule_evaluator = dict(type=MATHVerifyEvaluator)
 
 # Define an LLM judge evaluator
 llm_judge_evaluator = dict(
 
@@ -2,7 +2,7 @@
 
 ## Introduction
 
-Mathematical reasoning is a crucial capability for large language models (LLMs). To evaluate a model's mathematical abilities, we need to test its capability to solve mathematical problems step by step and provide accurate final answers. OpenCompass provides a convenient way to evaluate mathematical reasoning through the CustomDataset and MATHEvaluator components.
+Mathematical reasoning is a crucial capability for large language models (LLMs). To evaluate a model's mathematical abilities, we need to test its capability to solve mathematical problems step by step and provide accurate final answers. OpenCompass provides a convenient way to evaluate mathematical reasoning through the CustomDataset and MATHVerifyEvaluator components.
 
 ## Dataset Format
 
@@ -61,7 +61,7 @@ math_infer_cfg = dict(
 
 ```python
 math_eval_cfg = dict(
-    evaluator=dict(type=MATHEvaluator),
+    evaluator=dict(type=MATHVerifyEvaluator),
 )
 ```
 
@@ -86,11 +86,11 @@ math_datasets = [
 ]
 ```
 
-## MATHEvaluator
+## MATHVerifyEvaluator
 
-The MATHEvaluator is specifically designed to evaluate mathematical answers. It is developed based on the math_verify library, which provides mathematical expression parsing and verification capabilities, supporting extraction and equivalence verification for both LaTeX and general expressions.
+The MATHVerifyEvaluator is specifically designed to evaluate mathematical answers. It is developed based on the math_verify library, which provides mathematical expression parsing and verification capabilities, supporting extraction and equivalence verification for both LaTeX and general expressions.
 
-The MATHEvaluator implements:
+The MATHVerifyEvaluator implements:
 
 1. Extracts answers from both predictions and references using LaTeX extraction
 2. Handles various LaTeX formats and environments
@@ -133,7 +133,7 @@ Here's a complete example of how to set up math evaluation:
 from mmengine.config import read_base
 from opencompass.models import TurboMindModelwithChatTemplate
 from opencompass.datasets import CustomDataset
-from opencompass.openicl.icl_evaluator.math_evaluator import MATHEvaluator
+from opencompass.openicl.icl_evaluator.math_evaluator import MATHVerifyEvaluator
 from opencompass.openicl.icl_prompt_template import PromptTemplate
 from opencompass.openicl.icl_retriever import ZeroRetriever
 from opencompass.openicl.icl_inferencer import GenInferencer
@@ -160,7 +160,7 @@ math_infer_cfg = dict(
 
 # Evaluation configuration
 math_eval_cfg = dict(
-    evaluator=dict(type=MATHEvaluator),
+    evaluator=dict(type=MATHVerifyEvaluator),
 )
 
 # Dataset configuration
 
@@ -277,7 +277,7 @@ OpenCompass还提供了级联评估器`CascadeEvaluator`，它结合了规则式
 
 ```python
 # 定义规则式评估器
-rule_evaluator = dict(type=MATHEvaluator)
+rule_evaluator = dict(type=MATHVerifyEvaluator)
 
 # 定义LLM评判器
 llm_judge_evaluator = dict(
 
@@ -2,7 +2,7 @@
 
 ## 简介
 
-数学推理能力是大语言模型(LLMs)的一项关键能力。为了评估模型的数学能力，我们需要测试其逐步解决数学问题并提供准确最终答案的能力。OpenCompass 通过 CustomDataset 和 MATHEvaluator 组件提供了一种便捷的数学推理评测方式。
+数学推理能力是大语言模型(LLMs)的一项关键能力。为了评估模型的数学能力，我们需要测试其逐步解决数学问题并提供准确最终答案的能力。OpenCompass 通过 CustomDataset 和 MATHVerifyEvaluator 组件提供了一种便捷的数学推理评测方式。
 
 ## 数据集格式
 
@@ -61,7 +61,7 @@ math_infer_cfg = dict(
 
 ```python
 math_eval_cfg = dict(
-    evaluator=dict(type=MATHEvaluator),
+    evaluator=dict(type=MATHVerifyEvaluator),
 )
 ```
 
@@ -86,11 +86,11 @@ math_datasets = [
 ]
 ```
 
-## MATHEvaluator
+## MATHVerifyEvaluator
 
-MATHEvaluator 是专门设计用于评估数学答案的评测器。它基于 math_verify 库进行开发，该库提供了数学表达式解析和验证功能，支持 LaTeX 和一般表达式的提取与等价性验证。
+MATHVerifyEvaluator 是专门设计用于评估数学答案的评测器。它基于 math_verify 库进行开发，该库提供了数学表达式解析和验证功能，支持 LaTeX 和一般表达式的提取与等价性验证。
 
-MATHEvaluator 具有以下功能：
+MATHVerifyEvaluator 具有以下功能：
 
 1. 使用 LaTeX 提取器从预测和参考答案中提取答案
 2. 处理各种 LaTeX 格式和环境
@@ -133,7 +133,7 @@ MATHEvaluator 具有以下功能：
 from mmengine.config import read_base
 from opencompass.models import TurboMindModelwithChatTemplate
 from opencompass.datasets import CustomDataset
-from opencompass.openicl.icl_evaluator.math_evaluator import MATHEvaluator
+from opencompass.evaluator import MATHVerifyEvaluator
 from opencompass.openicl.icl_prompt_template import PromptTemplate
 from opencompass.openicl.icl_retriever import ZeroRetriever
 from opencompass.openicl.icl_inferencer import GenInferencer
@@ -160,7 +160,7 @@ math_infer_cfg = dict(
 
 # 评测配置
 math_eval_cfg = dict(
-    evaluator=dict(type=MATHEvaluator),
+    evaluator=dict(type=MATHVerifyEvaluator),
 )
 
 # 数据集配置
 
@@ -7,9 +7,12 @@
 from opencompass.openicl.icl_prompt_template import PromptTemplate
 from opencompass.openicl.icl_retriever import ZeroRetriever
 from opencompass.openicl.icl_inferencer import GenInferencer
-from opencompass.evaluator import GenericLLMEvaluator, CascadeEvaluator
+from opencompass.evaluator import (
+    GenericLLMEvaluator,
+    CascadeEvaluator,
+    MATHVerifyEvaluator,
+)
 from opencompass.datasets import generic_llmjudge_postprocess
-from opencompass.openicl.icl_evaluator import MATHEvaluator
 from opencompass.datasets import (
     MATHDataset,
     math_postprocess_v2,
@@ -94,7 +97,7 @@
         judge_cfg=dict(),
     )
 
-rule_evaluator =dict(type=MATHEvaluator)
+rule_evaluator =dict(type=MATHVerifyEvaluator)
 cascade_evaluator = dict(type=CascadeEvaluator,
                    llm_evaluator=llm_judge_evaluator,
                    rule_evaluator=rule_evaluator,
 
@@ -0,0 +1,142 @@
+
+import os.path as osp
+from opencompass.models import OpenAISDK
+from mmengine.config import read_base
+from opencompass.utils.text_postprocessors import extract_non_reasoning_content
+from opencompass.runners import LocalRunner
+from opencompass.partitioners import NaivePartitioner, NumWorkerPartitioner
+from opencompass.tasks import OpenICLInferTask, OpenICLEvalTask
+
+with read_base():
+    from opencompass.configs.datasets.aime2024.aime2024_cascade_eval_gen_5e9f4f import aime2024_datasets
+    from opencompass.configs.datasets.aime2025.aime2025_cascade_eval_gen_5e9f4f import aime2025_datasets
+    from opencompass.configs.datasets.math.math_500_cascade_eval_gen_6ff468 import math_datasets
+
+#######################################################################
+#                          PART 0  Meta Info                          #
+#######################################################################
+
+
+api_meta_template = dict(round=[
+    dict(role='HUMAN', api_role='HUMAN'),
+    dict(role='BOT', api_role='BOT', generate=True),
+], 
+)
+
+
+judge_cfg = dict(
+        abbr='qwen2-5-32B-Instruct',
+        type=OpenAISDK,
+        path='Qwen/Qwen2.5-32B-Instruct',
+        key='sk-1234',
+        openai_api_base=[
+            'http://x.x.x.x:4000/v1',
+        ],
+        meta_template=api_meta_template,
+        query_per_second=8,
+        batch_size=256,
+        temperature=0.001,
+        # max_completion_tokens=32768,
+        tokenizer_path='gpt-4o-2024-05-13',
+        # verbose=True,
+        max_out_len=16384,
+        max_seq_len=32768,
+        # max_seq_len=49152,
+        mode='mid',
+        retry=10
+)
+
+#######################################################################
+#                          PART 1  Datasets List                      #
+#######################################################################
+
+repeated_info = [
+    (math_datasets, 4),
+    (aime2024_datasets, 32),
+    (aime2025_datasets, 32),
+]
+
+for datasets_, num in repeated_info:
+    for dataset_ in datasets_:
+        dataset_['n'] = num
+
+datasets = sum(
+    (v for k, v in locals().items() if k.endswith('_datasets')),
+    [],
+)
+
+for item in datasets:
+    item['infer_cfg']['inferencer']['max_out_len'] = 32768
+    try:
+        if 'judge_cfg' in item['eval_cfg']['evaluator']:
+           item['eval_cfg']['evaluator']['judge_cfg'] = judge_cfg
+        elif'judge_cfg' in item['eval_cfg']['evaluator']['llm_evaluator']:
+            item['eval_cfg']['evaluator']['llm_evaluator']['judge_cfg'] = judge_cfg
+    except:
+        pass
+#######################################################################
+#                       PART 2  Dataset Summarizer                    #
+#######################################################################
+
+summarizer = dict(
+    dataset_abbrs=[
+        'MATH',
+        ['math_prm800k_500', 'accuracy (4 runs average)'],
+        ['aime2024', 'accuracy (32 runs average)'],
+        ['aime2025', 'accuracy (32 runs average)'],
+        ['livemathbench_hard', 'naive_average'],
+        ['OlympiadBenchMath', 'accuracy'],
+        ['olymmath', 'naive_average'],
+    ],
+    summary_groups = sum(
+        [v for k, v in locals().items() if k.endswith('_summary_groups')], []
+    ),
+)
+
+#######################################################################
+#                        PART 3  Models  List                         #
+#######################################################################
+models = sum([v for k, v in locals().items() if k.endswith('_model')], [])
+models += [
+
+    dict(
+        abbr='Qwen_Qwen3-235B-A22B',
+        type=OpenAISDK,
+        path='Qwen/Qwen3-235B-A22B',
+        key='sk-admin',
+        openai_api_base=[
+            'http://106.15.231.215:40007/v1/',
+        ],
+        meta_template=dict(
+            # begin=dict(role='SYSTEM', api_role='SYSTEM', prompt=''),
+            round=[
+                dict(role='HUMAN', api_role='HUMAN'),
+                # XXX: all system roles are mapped to human in purpose
+                dict(role='BOT', api_role='BOT', generate=True),
+            ]
+        ),
+        query_per_second=16,
+        batch_size=128,
+        # batch_size=1,
+        temperature=0.6,
+        # max_completion_tokens=32768,
+        tokenizer_path='gpt-4',
+        # verbose=True,
+        max_out_len=32768,
+        max_seq_len=32768,
+        pred_postprocessor=dict(type=extract_non_reasoning_content)
+    ),
+]
+
+infer = dict(
+    partitioner=dict(type=NumWorkerPartitioner, num_worker=8),
+    runner=dict(type=LocalRunner, task=dict(type=OpenICLInferTask)),
+)
+
+eval = dict(
+    partitioner=dict(type=NaivePartitioner, n=8),
+    runner=dict(type=LocalRunner, task=dict(type=OpenICLEvalTask)),
+)
+
+base_exp_dir = 'outputs/qwen3_reasoning'
+work_dir = osp.join(base_exp_dir, 'chat_objective')
@@ -12,8 +12,8 @@
 from opencompass.registry import PARTITIONERS, RUNNERS, build_from_cfg
 from opencompass.runners import SlurmRunner
 from opencompass.summarizers import DefaultSummarizer
-from opencompass.utils import (LarkReporter, get_logger, read_from_station,
-                               save_to_station)
+from opencompass.utils import (LarkReporter, get_logger, pretty_print_config,
+                               read_from_station, save_to_station)
 from opencompass.utils.run import (fill_eval_cfg, fill_infer_cfg,
                                    get_config_from_arg)
 
@@ -94,6 +94,11 @@ def parse_args():
         help='Use the custom config directory instead of config/ to '
         'search the configs for datasets, models and summarizers',
         type=str)
+    parser.add_argument(
+        '--config-verbose',
+        default=False,
+        action='store_true',
+        help='Whether to print the config in verbose mode.')
     parser.add_argument('-l',
                         '--lark',
                         help='Report the running status to lark bot',
@@ -131,7 +136,7 @@ def parse_args():
         'correctness of each sample, bpb, etc.',
         action='store_true',
     )
-
+    # for the results persistence
     parser.add_argument('-sp',
         '--station-path',
         help='Path to your results station.',
@@ -150,7 +155,12 @@ def parse_args():
              'data station.',
         action='store_true',
     )
-
+    # for evaluation with multiple runs
+    parser.add_argument('--dataset-num-runs',
+        help='How many runs for one dataset',
+        type=int,
+        default=1,
+    )
 
     # set srun args
     slurm_parser = parser.add_argument_group('slurm_args')
@@ -299,6 +309,11 @@ def main():
         content = f'{getpass.getuser()}\'s task has been launched!'
         LarkReporter(cfg['lark_bot_url']).post(content)
 
+
+    # print config if specified --config-verbose
+    if args.config_verbose:
+        pretty_print_config(cfg)
+
     # infer
     if args.mode in ['all', 'infer']:
         # When user have specified --slurm or --dlc, or have not set