-
Notifications
You must be signed in to change notification settings - Fork 2.3k
OpenR1-Qwen-7B achieves 47.40 on AIME24, better than reported! #622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@NathanHB Do you have any idea? Any comment can be helpful. |
I get a similar result (math_pass@1:32_samples 0.482 on AIME24 using the downloaded OpenR1-Qwen-7B weights). |
@Hasuer how did you compute math_pass@1:32_samples ? |
I think the latest version of |
I can not run original evaluation code with |
Maybe the old version of |
But I just clone the repo two days ago, and use What modifications do you make to run the evaluation code? Can you run the evalaution code with |
|
Thanks for your instructions. And I’m really wondering how the reported score of OpenR1-Qwen-7B on AIME24 could be 36.7. Even I calculate the |
As far as I know, the |
Exactly, so it seems that huggingface underestimated the performance of their model (at least in AIME24) |
I guess with the updation of |
Also, according to the comment on lighteval's related codes: |
@Hasuer @StarLooo Thanks for the interesting discussions. Just to recap, the following computes
How can one compute Would it simply be Also to note that in the new version (or probably the old one as well), something like I rely on this package for comprehensive evaluations but it is really slow. |
| How can one compute pass @k which samples K times, with the newest version of lighteval ? You can make the following modifications:
For other benchmarks, you can use the same steps. |
You may have confused pass@k with the sampling number n. You can see the detailed computation process (https://github.com/huggingface/lighteval/blob/main/src/lighteval/metrics/metrics_sample.py#L1118) in |
Hi ! thanks for your interest in this ! For the data parallel issues, it is caused by the latest vllm version, use an earlier version to make it work, as we do not have a fix for now :) |
FYI data parallel is now working on the latest version of |
Hi, I copy the newest Makefile and rerun
But shows the error "RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method" If I use
It hangs here:
|
Hi @Hasuer I believe one needs to pass the following env var:
Does this solve the issue? |
Hi, I believe |
If setting |
The reported OpenR1-Qwen-7B results on AIME24 is 36.7.
While I download the model from huggingface, and use lighteval to evaluate it, I get the results below:
Which is much higher than reported!
The evaluation code:
I tried to use data_parallel_size, but encounter with this issue.
Besides, the vllm version I use is
0.8.3
, ray2.43.0
, lighteval0.8.1.dev0
.Has anyone ever faced this situation? Thanks in advance.
@lewtun Do you have any idea? Any comment can be helpful.
The text was updated successfully, but these errors were encountered: