Skip to content

OpenR1-Qwen-7B achieves 47.40 on AIME24, better than reported! #622

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Hasuer opened this issue Apr 24, 2025 · 22 comments
Open

OpenR1-Qwen-7B achieves 47.40 on AIME24, better than reported! #622

Hasuer opened this issue Apr 24, 2025 · 22 comments

Comments

@Hasuer
Copy link

Hasuer commented Apr 24, 2025

The reported OpenR1-Qwen-7B results on AIME24 is 36.7.

While I download the model from huggingface, and use lighteval to evaluate it, I get the results below:

Task Version Metric Value Stderr
all math_pass@1:32_samples 0.4740 ± 0.0651
extractive_match 0.4667 ± 0.0926
lighteval:aime24:0 1 math_pass@1:32_samples 0.4740 ± 0.0651
extractive_match 0.4667 ± 0.0926

Which is much higher than reported!

The evaluation code:

MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,tensor_parallel_size=4,max_model_length=32768,max_num_batched_tokens=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"

lighteval vllm $MODEL_ARGS "lighteval|aime24|0|0" \
      --use-chat-template \
      --output-dir "$OUTPUT_DIR"

I tried to use data_parallel_size, but encounter with this issue.

Besides, the vllm version I use is 0.8.3, ray 2.43.0, lighteval 0.8.1.dev0.

Has anyone ever faced this situation? Thanks in advance.

@lewtun Do you have any idea? Any comment can be helpful.

@Hasuer
Copy link
Author

Hasuer commented Apr 24, 2025

  1. I don't add export VLLM_WORKER_MULTIPROC_METHOD=spawn in my evaluation code, which is added in tensor parallel version of evaluated code. I am not sure whether this would has an effect on the evaluation result?
  2. I notice that lighteval apply different prompt for different tasks. For AIME24, the prompt can be found here. I wonder the reported result using the same prompt or other settings?

@NathanHB Do you have any idea? Any comment can be helpful.

@StarLooo
Copy link

StarLooo commented Apr 25, 2025

I get a similar result (math_pass@1:32_samples 0.482 on AIME24 using the downloaded OpenR1-Qwen-7B weights).
But I cannot run the original evaluation code directly, so I try to make some modifications (See: #602) on it and then successfully run the lighteval evaluation.

@ahatamiz
Copy link

@Hasuer how did you compute math_pass@1:32_samples ? lighteval|aime24|0|0 does not seem to be giving you this.

@StarLooo
Copy link

@Hasuer how did you compute math_pass@1:32_samples ? lighteval|aime24|0|0 does not seem to be giving you this.

I think the latest version of lighteval has already integrated the aime24 task into its official supported task (see: https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/default_tasks.py#L315), which support the computation of pass@1 metric.
However, the old version of this open-r1 repository implement the evaluation of aime24 by adding implementions from itself, since the old version of lighteval didn't support aime24 directly.

@Hasuer
Copy link
Author

Hasuer commented Apr 25, 2025

I get a similar result (math_pass@1:32_samples 0.482 on AIME24 using the downloaded OpenR1-Qwen-7B weights). But I cannot run the original evaluation code directly, so I try to make some modifications (See: #602) on it and then successfully run the lighteval evaluation.

I can not run original evaluation code with data parallel either, use the tensor_parallel_size is ok. And I do not make other modifications.

@StarLooo
Copy link

I get a similar result (math_pass@1:32_samples 0.482 on AIME24 using the downloaded OpenR1-Qwen-7B weights). But I cannot run the original evaluation code directly, so I try to make some modifications (See: #602) on it and then successfully run the lighteval evaluation.

I can not run original evaluation code with data parallel either, use the tensor_parallel_size is ok. And I do not make other modifications.

Maybe the old version of open-r1 with the old version of lighteval can run the evaluation code without modification.
The data parallel problem is also reported here: huggingface/lighteval#670

@Hasuer
Copy link
Author

Hasuer commented Apr 25, 2025

I get a similar result (math_pass@1:32_samples 0.482 on AIME24 using the downloaded OpenR1-Qwen-7B weights). But I cannot run the original evaluation code directly, so I try to make some modifications (See: #602) on it and then successfully run the lighteval evaluation.

I can not run original evaluation code with data parallel either, use the tensor_parallel_size is ok. And I do not make other modifications.

Maybe the old version of open-r1 with the old version of lighteval can run the evaluation code without modification. The data parallel problem is also reported here: huggingface/lighteval#670

But I just clone the repo two days ago, and use make install to create the uv environment. The vllm version I use is 0.8.3, ray 2.43.0, lighteval 0.8.1.dev0 .

What modifications do you make to run the evaluation code? Can you run the evalaution code with data_parallel_size param in MODEL_ARGS?

@StarLooo
Copy link

StarLooo commented Apr 25, 2025

But I just clone the repo two days ago, and use make install to create the uv environment. The vllm version I use is 0.8.3, ray 2.43.0, lighteval 0.8.1.dev0 .

What modifications do you make to run the evaluation code? Can you run the evalaution code with data_parallel_size param in MODEL_ARGS?

  1. You can refer to the modifications that I made to run the evaluation code from here: Is vllm==0.8.3 causing some incompatible problems #602 (comment).
  2. I installed the lighteval from source code. When I use "pip show lighteval" to check its version, it shows: 0.8.1.dev0.
  3. I met the similar problem when using lighteval vllm with data parallel similar to this issue: [BUG] vLLM backend hangs with DDP lighteval#670
  4. I'm not very sure about the detailed influence of different versions of open-r1, lighteval and vllm. Especially the open-r1's repository updates very frequently.
    And all the reference issues I mentioned above may contain useful information about how to run the evaluation.

@Hasuer
Copy link
Author

Hasuer commented Apr 25, 2025

Thanks for your instructions. And I’m really wondering how the reported score of OpenR1-Qwen-7B on AIME24 could be 36.7. Even I calculate the math_pass@1:1_samples, the result also acheives 40+.

@StarLooo
Copy link

StarLooo commented Apr 25, 2025

Thanks for your instructions. And I’m really wondering how the reported score of OpenR1-Qwen-7B on AIME24 could be 36.7. Even I calculate the math_pass@1:1_samples, the result also acheives 40+.

As far as I know, the exact_match metric on AIME24 has a large variance especially for small models, different run may casue very different performance. Since the AIME24 only contains 30 questions, the math_pass@1:32_samples which sample 32 times on this dataset could be a better metric to monitor the model performance.
The recent reproduced evaluation also shows a better performance than the open-r1 reported values:
#545

@Hasuer
Copy link
Author

Hasuer commented Apr 25, 2025

Exactly, so it seems that huggingface underestimated the performance of their model (at least in AIME24)

@StarLooo
Copy link

Exactly, so it seems that huggingface underestimated the performance of their model (at least in AIME24)

I guess with the updation of lighteval, the metric computation (extract answer in the box and compare with ground truth) improves.
And the reported performance is computed using old version of lighteval a few months ago.

@StarLooo
Copy link

Also, according to the comment on lighteval's related codes:
https://github.com/huggingface/lighteval/blob/main/src/lighteval/metrics/dynamic_metrics.py#L200
There are known issues that should concern but difficult to adress.
Since the method to extract answer of math question and compare it with the ground-truth is complicated, as well as the influence of different prompts, the fluctuation and difference of different version of one evaluation framework / different evaluation frameworks are understandable.

@ahatamiz
Copy link

ahatamiz commented Apr 25, 2025

@Hasuer @StarLooo Thanks for the interesting discussions. Just to recap, the following computes math_pass@1:1_samples which is basically pass@1.

MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,tensor_parallel_size=4,max_model_length=32768,max_num_batched_tokens=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"

lighteval vllm $MODEL_ARGS "lighteval|aime24|0|0" \
      --use-chat-template \
      --output-dir "$OUTPUT_DIR"

How can one compute pass @K which samples K times, with the newest version of lighteval ?

Would it simply be

Also to note that in the new version (or probably the old one as well), something like lighteval|aime24|0|0 is basically following the format of suite|task|few_shot|truncate_few_shots. So you can only control the num few_shots but not the generations.

I rely on this package for comprehensive evaluations but it is really slow.

@Hasuer
Copy link
Author

Hasuer commented Apr 25, 2025

| How can one compute pass @k which samples K times, with the newest version of lighteval ?

You can make the following modifications:

  1. This line define a math_pass_at_1_4n, which means k is 1 and generate 4 samples per question. To calculate pass@2, you can deifne math_pass_at_2_4n by passing k = 2 when initializing PassAtK clas, which means k is 2 and generate 4 samples per question.
  2. Add the metirc you just define here if you want to evaluate AIME24 by pass@2.

For other benchmarks, you can use the same steps.

@StarLooo
Copy link

StarLooo commented Apr 27, 2025

@Hasuer @StarLooo Thanks for the interesting discussions. Just to recap, the following computes math_pass@1:1_samples which is basically pass@1.

MODEL_ARGS="pretrained=$MODEL,dtype=bfloat16,tensor_parallel_size=4,max_model_length=32768,max_num_batched_tokens=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"

lighteval vllm $MODEL_ARGS "lighteval|aime24|0|0" \
      --use-chat-template \
      --output-dir "$OUTPUT_DIR"

How can one compute pass @K which samples K times, with the newest version of lighteval ?

Would it simply be

Also to note that in the new version (or probably the old one as well), something like lighteval|aime24|0|0 is basically following the format of suite|task|few_shot|truncate_few_shots. So you can only control the num few_shots but not the generations.

I rely on this package for comprehensive evaluations but it is really slow.

You may have confused pass@k with the sampling number n. You can see the detailed computation process (https://github.com/huggingface/lighteval/blob/main/src/lighteval/metrics/metrics_sample.py#L1118) in lighteval for a clear understandin.
The latest ligheval use k=1, n=32 by deafult to compute math_pass_at_1_32n.
Also, I notice that you are using the command offered by open-r1 directly. But I could not run it directly and made some modifications. (see: #602 (comment) & #602 (comment) & #602 (comment))
Please make sure you have updated the version of open-r1 as well as lighteval.

@NathanHB
Copy link
Member

Hi ! thanks for your interest in this !
@StarLooo you are right, the old results for aime24 were computed using an older version of lighteval and we improved a few things since them hence the better results. Improvements were made for both metric and vllm model directly. Also, generation_parameters have better defaults now.

For the data parallel issues, it is caused by the latest vllm version, use an earlier version to make it work, as we do not have a fix for now :)

@Hasuer Hasuer closed this as completed Apr 30, 2025
@lewtun
Copy link
Member

lewtun commented May 5, 2025

FYI data parallel is now working on the latest version of vllm so if you update your env with the current dependencies in setup.py then it should work for you: https://github.com/huggingface/open-r1?tab=readme-ov-file#evaluating-models

@Hasuer Hasuer reopened this May 9, 2025
@Hasuer
Copy link
Author

Hasuer commented May 9, 2025

FYI data parallel is now working on the latest version of vllm so if you update your env with the current dependencies in setup.py then it should work for you: https://github.com/huggingface/open-r1?tab=readme-ov-file#evaluating-models

Hi, I copy the newest Makefile and rerun make install. Then I use the MODEL_ARGS to evaluate:

MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,tensor_parallel_size=8,max_model_length=32768,max_num_batched_tokens=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:16384,temperature:0.6,top_p:0.95}"

But shows the error "RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method"

If I use data_parallel_size like:

MODEL_ARGS="model_name=$MODEL,dtype=bfloat16,data_parallel_size=8,max_model_length=32768,max_num_batched_tokens=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:16384,temperature:0.6,top_p:0.95}"

It hangs here:

(run_inference_one_model pid=141403) INFO 05-09 12:31:27 [ray_utils.py:288] Ray is already initialized. Skipping Ray initialization.
(run_inference_one_model pid=141403) INFO 05-09 12:31:27 [ray_utils.py:335] No current placement group found. Creating a new placement group.

@lewtun
Copy link
Member

lewtun commented May 9, 2025

Hi @Hasuer I believe one needs to pass the following env var:

export VLLM_WORKER_MULTIPROC_METHOD=spawn

Does this solve the issue?

@Nativu5
Copy link

Nativu5 commented May 18, 2025

Hi @Hasuer I believe one needs to pass the following env var:

export VLLM_WORKER_MULTIPROC_METHOD=spawn

Does this solve the issue?

Hi, I believe VLLM_WORKER_MULTIPROC_METHOD=spawn is not helpful. I am still struggling with No current placement group found. Creating a new placement group. with latest lighteval (0.9.2), vllm (v0.8.5.post1) and ray (2.46.0).

@ytw0415
Copy link

ytw0415 commented May 23, 2025

If setting export VLLM_WORKER_MULTIPROC_METHOD=spawn doesn't solve the problem, which version of vLLM should I use?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants