OpenR1-Qwen-7B achieves 47.40 on AIME24, better than reported! #622

Hasuer · 2025-04-24T04:20:46Z

The reported OpenR1-Qwen-7B results on AIME24 is 36.7.

While I download the model from huggingface, and use lighteval to evaluate it, I get the results below:

Task	Version	Metric	Value		Stderr
all		math_pass@1:32_samples	0.4740	±	0.0651
		extractive_match	0.4667	±	0.0926
lighteval:aime24:0	1	math_pass@1:32_samples	0.4740	±	0.0651
		extractive_match	0.4667	±	0.0926

Which is much higher than reported!

The evaluation code:

MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,tensor_parallel_size=4,max_model_length=32768,max_num_batched_tokens=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"

lighteval vllm $MODEL_ARGS "lighteval|aime24|0|0" \
      --use-chat-template \
      --output-dir "$OUTPUT_DIR"

I tried to use data_parallel_size, but encounter with this issue.

Besides, the vllm version I use is 0.8.3, ray 2.43.0, lighteval 0.8.1.dev0.

Has anyone ever faced this situation? Thanks in advance.

@lewtun Do you have any idea? Any comment can be helpful.

The text was updated successfully, but these errors were encountered:

Hasuer · 2025-04-24T04:27:10Z

I don't add export VLLM_WORKER_MULTIPROC_METHOD=spawn in my evaluation code, which is added in tensor parallel version of evaluated code. I am not sure whether this would has an effect on the evaluation result?
I notice that lighteval apply different prompt for different tasks. For AIME24, the prompt can be found here. I wonder the reported result using the same prompt or other settings?

@NathanHB Do you have any idea? Any comment can be helpful.

StarLooo · 2025-04-25T01:48:31Z

I get a similar result (math_pass@1:32_samples 0.482 on AIME24 using the downloaded OpenR1-Qwen-7B weights).
But I cannot run the original evaluation code directly, so I try to make some modifications (See: #602) on it and then successfully run the lighteval evaluation.

ahatamiz · 2025-04-25T05:54:32Z

@Hasuer how did you compute math_pass@1:32_samples ? lighteval|aime24|0|0 does not seem to be giving you this.

StarLooo · 2025-04-25T06:01:06Z

@Hasuer how did you compute math_pass@1:32_samples ? lighteval|aime24|0|0 does not seem to be giving you this.

I think the latest version of lighteval has already integrated the aime24 task into its official supported task (see: https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/default_tasks.py#L315), which support the computation of pass@1 metric.
However, the old version of this open-r1 repository implement the evaluation of aime24 by adding implementions from itself, since the old version of lighteval didn't support aime24 directly.

Hasuer · 2025-04-25T07:11:52Z

I get a similar result (math_pass@1:32_samples 0.482 on AIME24 using the downloaded OpenR1-Qwen-7B weights). But I cannot run the original evaluation code directly, so I try to make some modifications (See: #602) on it and then successfully run the lighteval evaluation.

I can not run original evaluation code with data parallel either, use the tensor_parallel_size is ok. And I do not make other modifications.

StarLooo · 2025-04-25T07:15:33Z

I get a similar result (math_pass@1:32_samples 0.482 on AIME24 using the downloaded OpenR1-Qwen-7B weights). But I cannot run the original evaluation code directly, so I try to make some modifications (See: #602) on it and then successfully run the lighteval evaluation.

I can not run original evaluation code with data parallel either, use the tensor_parallel_size is ok. And I do not make other modifications.

Maybe the old version of open-r1 with the old version of lighteval can run the evaluation code without modification.
The data parallel problem is also reported here: huggingface/lighteval#670

Hasuer · 2025-04-25T07:26:17Z

I get a similar result (math_pass@1:32_samples 0.482 on AIME24 using the downloaded OpenR1-Qwen-7B weights). But I cannot run the original evaluation code directly, so I try to make some modifications (See: #602) on it and then successfully run the lighteval evaluation.

I can not run original evaluation code with data parallel either, use the tensor_parallel_size is ok. And I do not make other modifications.

Maybe the old version of open-r1 with the old version of lighteval can run the evaluation code without modification. The data parallel problem is also reported here: huggingface/lighteval#670

But I just clone the repo two days ago, and use make install to create the uv environment. The vllm version I use is 0.8.3, ray 2.43.0, lighteval 0.8.1.dev0 .

What modifications do you make to run the evaluation code? Can you run the evalaution code with data_parallel_size param in MODEL_ARGS?

StarLooo · 2025-04-25T07:36:48Z

But I just clone the repo two days ago, and use make install to create the uv environment. The vllm version I use is 0.8.3, ray 2.43.0, lighteval 0.8.1.dev0 .

What modifications do you make to run the evaluation code? Can you run the evalaution code with data_parallel_size param in MODEL_ARGS?

You can refer to the modifications that I made to run the evaluation code from here: Is vllm==0.8.3 causing some incompatible problems #602 (comment).
I installed the lighteval from source code. When I use "pip show lighteval" to check its version, it shows: 0.8.1.dev0.
I met the similar problem when using lighteval vllm with data parallel similar to this issue: [BUG] vLLM backend hangs with DDP lighteval#670
I'm not very sure about the detailed influence of different versions of open-r1, lighteval and vllm. Especially the open-r1's repository updates very frequently.
And all the reference issues I mentioned above may contain useful information about how to run the evaluation.

Hasuer · 2025-04-25T07:48:49Z

Thanks for your instructions. And I’m really wondering how the reported score of OpenR1-Qwen-7B on AIME24 could be 36.7. Even I calculate the math_pass@1:1_samples, the result also acheives 40+.

StarLooo · 2025-04-25T08:09:59Z

Thanks for your instructions. And I’m really wondering how the reported score of OpenR1-Qwen-7B on AIME24 could be 36.7. Even I calculate the math_pass@1:1_samples, the result also acheives 40+.

As far as I know, the exact_match metric on AIME24 has a large variance especially for small models, different run may casue very different performance. Since the AIME24 only contains 30 questions, the math_pass@1:32_samples which sample 32 times on this dataset could be a better metric to monitor the model performance.
The recent reproduced evaluation also shows a better performance than the open-r1 reported values:
#545

Hasuer · 2025-04-25T08:14:34Z

Exactly, so it seems that huggingface underestimated the performance of their model (at least in AIME24)

StarLooo · 2025-04-25T08:26:54Z

Exactly, so it seems that huggingface underestimated the performance of their model (at least in AIME24)

I guess with the updation of lighteval, the metric computation (extract answer in the box and compare with ground truth) improves.
And the reported performance is computed using old version of lighteval a few months ago.

StarLooo · 2025-04-25T08:35:44Z

Also, according to the comment on lighteval's related codes:
https://github.com/huggingface/lighteval/blob/main/src/lighteval/metrics/dynamic_metrics.py#L200
There are known issues that should concern but difficult to adress.
Since the method to extract answer of math question and compare it with the ground-truth is complicated, as well as the influence of different prompts, the fluctuation and difference of different version of one evaluation framework / different evaluation frameworks are understandable.

ahatamiz · 2025-04-25T15:58:21Z

@Hasuer @StarLooo Thanks for the interesting discussions. Just to recap, the following computes math_pass@1:1_samples which is basically pass@1.

MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,tensor_parallel_size=4,max_model_length=32768,max_num_batched_tokens=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"

lighteval vllm $MODEL_ARGS "lighteval|aime24|0|0" \
      --use-chat-template \
      --output-dir "$OUTPUT_DIR"

How can one compute pass @K which samples K times, with the newest version of lighteval ?

Would it simply be

I rely on this package for comprehensive evaluations but it is really slow.

Hasuer · 2025-04-25T17:29:19Z

| How can one compute pass @k which samples K times, with the newest version of lighteval ?

You can make the following modifications:

This line define a math_pass_at_1_4n, which means k is 1 and generate 4 samples per question. To calculate pass@2, you can deifne math_pass_at_2_4n by passing k = 2 when initializing PassAtK clas, which means k is 2 and generate 4 samples per question.
Add the metirc you just define here if you want to evaluate AIME24 by pass@2.

For other benchmarks, you can use the same steps.

StarLooo · 2025-04-27T02:03:54Z

@Hasuer @StarLooo Thanks for the interesting discussions. Just to recap, the following computes math_pass@1:1_samples which is basically pass@1.
MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,tensor_parallel_size=4,max_model_length=32768,max_num_batched_tokens=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"

lighteval vllm $MODEL_ARGS "lighteval|aime24|0|0" \
      --use-chat-template \
      --output-dir "$OUTPUT_DIR"
How can one compute pass @K which samples K times, with the newest version of lighteval ?

Would it simply be

Also to note that in the new version (or probably the old one as well), something like lighteval|aime24|0|0 is basically following the format of suite|task|few_shot|truncate_few_shots. So you can only control the num few_shots but not the generations.

I rely on this package for comprehensive evaluations but it is really slow.

You may have confused pass@k with the sampling number n. You can see the detailed computation process (https://github.com/huggingface/lighteval/blob/main/src/lighteval/metrics/metrics_sample.py#L1118) in lighteval for a clear understandin.
The latest ligheval use k=1, n=32 by deafult to compute math_pass_at_1_32n.
Also, I notice that you are using the command offered by open-r1 directly. But I could not run it directly and made some modifications. (see: #602 (comment) & #602 (comment) & #602 (comment))
Please make sure you have updated the version of open-r1 as well as lighteval.

NathanHB · 2025-04-28T11:31:18Z

Hi ! thanks for your interest in this !
@StarLooo you are right, the old results for aime24 were computed using an older version of lighteval and we improved a few things since them hence the better results. Improvements were made for both metric and vllm model directly. Also, generation_parameters have better defaults now.

For the data parallel issues, it is caused by the latest vllm version, use an earlier version to make it work, as we do not have a fix for now :)

lewtun · 2025-05-05T15:21:59Z

FYI data parallel is now working on the latest version of vllm so if you update your env with the current dependencies in setup.py then it should work for you: https://github.com/huggingface/open-r1?tab=readme-ov-file#evaluating-models

Hasuer · 2025-05-09T05:49:12Z

FYI data parallel is now working on the latest version of vllm so if you update your env with the current dependencies in setup.py then it should work for you: https://github.com/huggingface/open-r1?tab=readme-ov-file#evaluating-models

Hi, I copy the newest Makefile and rerun make install. Then I use the MODEL_ARGS to evaluate:

MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,tensor_parallel_size=8,max_model_length=32768,max_num_batched_tokens=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:16384,temperature:0.6,top_p:0.95}"

But shows the error "RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method"

If I use data_parallel_size like:

MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,data_parallel_size=8,max_model_length=32768,max_num_batched_tokens=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:16384,temperature:0.6,top_p:0.95}"

It hangs here:

(run_inference_one_model pid=141403) INFO 05-09 12:31:27 [ray_utils.py:288] Ray is already initialized. Skipping Ray initialization.
(run_inference_one_model pid=141403) INFO 05-09 12:31:27 [ray_utils.py:335] No current placement group found. Creating a new placement group.

lewtun · 2025-05-09T13:30:46Z

Hi @Hasuer I believe one needs to pass the following env var:

export VLLM_WORKER_MULTIPROC_METHOD=spawn

Does this solve the issue?

Nativu5 · 2025-05-18T13:56:19Z

Hi @Hasuer I believe one needs to pass the following env var:
export VLLM_WORKER_MULTIPROC_METHOD=spawn
Does this solve the issue?

Hi, I believe VLLM_WORKER_MULTIPROC_METHOD=spawn is not helpful. I am still struggling with No current placement group found. Creating a new placement group. with latest lighteval (0.9.2), vllm (v0.8.5.post1) and ray (2.46.0).

ytw0415 · 2025-05-23T06:28:38Z

If setting export VLLM_WORKER_MULTIPROC_METHOD=spawn doesn't solve the problem, which version of vLLM should I use?

Hasuer closed this as completed Apr 30, 2025

Hasuer reopened this May 9, 2025

OpenR1-Qwen-7B achieves 47.40 on AIME24, better than reported! #622

OpenR1-Qwen-7B achieves 47.40 on AIME24, better than reported! #622

Comments

Hasuer commented Apr 24, 2025

Hasuer commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StarLooo commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ahatamiz commented Apr 25, 2025

Uh oh!

StarLooo commented Apr 25, 2025

Uh oh!

Hasuer commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StarLooo commented Apr 25, 2025

Uh oh!

Hasuer commented Apr 25, 2025

Uh oh!

StarLooo commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hasuer commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

StarLooo commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hasuer commented Apr 25, 2025

Uh oh!

StarLooo commented Apr 25, 2025

Uh oh!

StarLooo commented Apr 25, 2025

Uh oh!

ahatamiz commented Apr 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hasuer commented Apr 25, 2025

Uh oh!

StarLooo commented Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NathanHB commented Apr 28, 2025

Uh oh!

lewtun commented May 5, 2025

Uh oh!

Hasuer commented May 9, 2025

Uh oh!

lewtun commented May 9, 2025

Uh oh!

Nativu5 commented May 18, 2025

Uh oh!

ytw0415 commented May 23, 2025

Uh oh!

Hasuer commented Apr 24, 2025 •

edited

Loading

StarLooo commented Apr 25, 2025 •

edited

Loading

Hasuer commented Apr 25, 2025 •

edited

Loading

StarLooo commented Apr 25, 2025 •

edited

Loading

Hasuer commented Apr 25, 2025 •

edited

Loading

StarLooo commented Apr 25, 2025 •

edited

Loading

ahatamiz commented Apr 25, 2025 •

edited

Loading

StarLooo commented Apr 27, 2025 •

edited

Loading