-
Notifications
You must be signed in to change notification settings - Fork 323
[Evals] Add results on non-reasoning tasks #26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Hi @SumanthRH, can we also add eval on Arena-hard? |
@caoshiyi working on it now! |
Thank you. This is amazing @SumanthRH ! Could you also include a column of results from qwen-32b-instruct and QwQ-32B-Preview run on lm-eval, to make sure the current implementation is correct? Thank you! |
@caoshiyi Ran Arena-hard evals for
Full results can be found here: Cached answers + judge ratings for regenerating the leaderboard can be found here: |
@DachengLi1 We've actually computed results for QwQ and Qwen 32B instruct ! It's in the results table in the Readme. On |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The results look great! Thanks for the efforts @erictang000 @SumanthRH ! Also, I am wondering how much efforts are needed if we also want a unified script to run all of these evaluations since currently it still requires the users to set up things differently for different benchmarks (e.g., lm_eval and bfcl).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another things as I'm trying out with this eval suite, it seems like the IFEval
is not printing any ultimate accuracy output according to the instructions? the other tasks on lm_eval
works ok. Not sure if @SumanthRH @erictang000 you have encountered with this problem before
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Also another thing I've been noticing is that, we need to provide the same system instruction for our model as in the reasoning task evaluation. https://github.com/NovaSky-AI/SkyThought/blob/main/skythought/tools/util/model_utils.py#L32 So additional parameter of setting this for the I run this again and get similar scores tho, only slightly lower in MMLU. Maybe you guys can rerun and verify it again? @SumanthRH @erictang000 Thanks for the great help! |
Thanks for pointing this out @lynnliu030 ! I missed an instruction for
I updated the installation instructions. Also I noticed that I had used |
@SumanthRH sounds good to me! Btw I think the main README.md reference to the instruction files also need to change as well. |
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
@lynnliu030 Done! |
I revisited QwQ evaluation and found that the current results for QwQ are incorrect due to their chat template. TLDR: QwQ's chat template defaults to a CoT-based system prompt when none is provided:
You can see the same here: https://huggingface.co/Qwen/QwQ-32B-Preview/blob/91906fe41a48b6a89ce2970abfd1269eefee170e/tokenizer_config.json#L197 What this means is that for tasks where a
For the sake of consistency, I think it's best if we just use a different revision of QwQ for our evaluation (otherwise, we'd need to have special system prompt handling based on the task). I've made a revision here (
I'm going to reevaluate QwQ now based on this. On IFEval for example, score improved from 34.75 to 42.51 by removing the CoT text. (Score for MMLU was not affected, which was expected). |
@SumanthRH are all the re-evals done & updated? We can merge this PR for now if that's done. Thanks! |
Signed-off-by: SumanthRH <[email protected]>
@lynnliu030 updating Arena hard right now, and then we should be good. |
@lynnliu030 Updated arena-hard - should be all set with the re-evals! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@erictang000 @SumanthRH I think this can be merged now, thanks for the work!
* init Signed-off-by: SumanthRH <[email protected]> * x Signed-off-by: SumanthRH <[email protected]> * x Signed-off-by: SumanthRH <[email protected]> * x Signed-off-by: SumanthRH <[email protected]> * x Signed-off-by: SumanthRH <[email protected]> * Update README.md * Update README.md * Update README.md * arena-hard results * make ifeval tp size consistent; move evals md file Signed-off-by: SumanthRH <[email protected]> * x Signed-off-by: SumanthRH <[email protected]> * x Signed-off-by: SumanthRH <[email protected]> * add arena hard to final table; fix link Signed-off-by: SumanthRH <[email protected]> * new results for QwQ Signed-off-by: SumanthRH <[email protected]> * Update base_instruct_evals.md * Update base_instruct_evals.md * Update README.md --------- Signed-off-by: SumanthRH <[email protected]> Co-authored-by: Eric Tang <[email protected]>
What does this PR do?
More evaluation results from @erictang000 and I. We evaluated the OSS models on instruction-following and QA benchmarks like IFEval, MMLU etc
direct
)direct
)Note: We've included ARC-C here as well, separate from the ARC-C results with CoT that is implemented in #21 . We've tried to use the standard no-CoT setting used for evaluating base and instruct models - with assistant response prefill ("<assistant_header> The best answer is") + single token decode.