Skip to content

[Evals] Add results on non-reasoning tasks #26

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Jan 24, 2025

Conversation

SumanthRH
Copy link
Collaborator

@SumanthRH SumanthRH commented Jan 17, 2025

What does this PR do?

More evaluation results from @erictang000 and I. We evaluated the OSS models on instruction-following and QA benchmarks like IFEval, MMLU etc

Metric Sky-T1-32B-Preview Qwen-2.5-32B-Instruct QwQ-32B-Preview Eval Implementation
MMLU (0 shot; no CoT) 78.36 74.14 71.23 lm_eval
MMLU (5 shot; no CoT) 82.46 82.62 82.32 lm_eval
ARC-C (0 shot; no CoT) 49.49 49.4 46.25 lm_eval
IFEval 74.68 79.3 34.75 lm_eval
LLM-as-a-Judge 9.12 9.19 8.36 fastchat
MGSM (0 shot; direct) 33 42.3 15.5 lm_eval
MGSM (8-shot; direct) 58.4 61.47 59.97 lm_eval
BFCL-v3 53.18 58.92 17.41 BFCL

Note: We've included ARC-C here as well, separate from the ARC-C results with CoT that is implemented in #21 . We've tried to use the standard no-CoT setting used for evaluating base and instruct models - with assistant response prefill ("<assistant_header> The best answer is") + single token decode.

Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
@SumanthRH SumanthRH marked this pull request as ready for review January 17, 2025 22:33
x
Signed-off-by: SumanthRH <[email protected]>
@SumanthRH SumanthRH changed the title [Evals] Add results on instruction-following tasks [Evals] Add results on non-reasoning tasks Jan 17, 2025
@caoshiyi
Copy link
Member

Hi @SumanthRH, can we also add eval on Arena-hard?

@erictang000
Copy link
Collaborator

@caoshiyi working on it now!

@DachengLi1
Copy link
Collaborator

Thank you. This is amazing @SumanthRH ! Could you also include a column of results from qwen-32b-instruct and QwQ-32B-Preview run on lm-eval, to make sure the current implementation is correct? Thank you!

@erictang000
Copy link
Collaborator

erictang000 commented Jan 18, 2025

@caoshiyi Ran Arena-hard evals for NovaSky-AI/Sky-T1-32B-Preview, Qwen/Qwen2.5-32B-Instruct, and Qwen/QwQ-32B-Preview. Scores are generated against the 72 other models currently found at https://huggingface.co/spaces/lmsys/arena-hard-browser. Here are the current results (+ the top score which is o1-mini):

model score rating_q025 rating_q975 CI avg_tokens date
o1-mini-2024-09-12 91.98 90.88 93.12 (-1.10, +1.14) 1399.0 2025-01-18
sky-T1-32B-Preview 74.79 72.28 76.8 (-2.51, +2.01) 847.0 2025-01-18
qwen2.5-32b-instruct 66.51 64.55 68.4 (-1.96, +1.89) 611.0 2025-01-18
qwq-32b-preview 52.6 50.86 54.91 (-1.74, +2.31) 1005.0 2025-01-23

Full results can be found here:
https://github.com/erictang000/arena-hard-auto/blob/skynova_evals/leaderboard/arena_hard_leaderboard_20250123.csv

Cached answers + judge ratings for regenerating the leaderboard can be found here:
https://github.com/erictang000/arena-hard-auto/tree/skynova_evals/data/arena-hard-v0.1

@SumanthRH
Copy link
Collaborator Author

SumanthRH commented Jan 18, 2025

@DachengLi1 We've actually computed results for QwQ and Qwen 32B instruct ! It's in the results table in the Readme. On lm_eval, for reproduction, I only gave the command for Sky-T1, and for the other models, it's just a matter of substituting the model ID. Hope that makes sense!

Copy link
Member

@caoshiyi caoshiyi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The results look great! Thanks for the efforts @erictang000 @SumanthRH ! Also, I am wondering how much efforts are needed if we also want a unified script to run all of these evaluations since currently it still requires the users to set up things differently for different benchmarks (e.g., lm_eval and bfcl).

Copy link
Member

@lynnliu030 lynnliu030 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another things as I'm trying out with this eval suite, it seems like the IFEval is not printing any ultimate accuracy output according to the instructions? the other tasks on lm_eval works ok. Not sure if @SumanthRH @erictang000 you have encountered with this problem before

@lynnliu030
Copy link
Member

Also another thing I've been noticing is that, we need to provide the same system instruction for our model as in the reasoning task evaluation. https://github.com/NovaSky-AI/SkyThought/blob/main/skythought/tools/util/model_utils.py#L32 So additional parameter of setting this for the lm_eval need to be set I believe.

I run this again and get similar scores tho, only slightly lower in MMLU. Maybe you guys can rerun and verify it again? @SumanthRH @erictang000 Thanks for the great help!

@SumanthRH
Copy link
Collaborator Author

SumanthRH commented Jan 21, 2025

Another things as I'm trying out with this eval suite, it seems like the IFEval is not printing any ultimate accuracy output according to the instructions? the other tasks on lm_eval works ok. Not sure if @SumanthRH @erictang000 you have encountered with this problem before

Thanks for pointing this out @lynnliu030 ! I missed an instruction for lm_eval : we need to install the dependencies for the ifeval task with the ifeval extra:

pip install -e .[ifeval]

I updated the installation instructions.

Also I noticed that I had used tensor_parallel_size=4 for IFEval while the rest use 8. To be consistent, I have changed this value to 8 and re-ran the evaluation for IFEval. I noticed that there is some variation (+-1%) in the scores depending on tensor_parallel_size used. I've also added a disclaimer about this in base_instruct_evals.md

@lynnliu030
Copy link
Member

@SumanthRH sounds good to me! Btw I think the main README.md reference to the instruction files also need to change as well.

x
Signed-off-by: SumanthRH <[email protected]>
@SumanthRH
Copy link
Collaborator Author

@lynnliu030 Done!

@SumanthRH
Copy link
Collaborator Author

SumanthRH commented Jan 22, 2025

I revisited QwQ evaluation and found that the current results for QwQ are incorrect due to their chat template.

TLDR: QwQ's chat template defaults to a CoT-based system prompt when none is provided:

You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step.

You can see the same here: https://huggingface.co/Qwen/QwQ-32B-Preview/blob/91906fe41a48b6a89ce2970abfd1269eefee170e/tokenizer_config.json#L197

What this means is that for tasks where a system message was not provided, the above CoT system prompt was added to all the samples. This can significantly change the results depending on the task.

  1. lm_eval results: Some tasks are affected. For ex., IFEval does not have any system prompt, but MMLU has dataset description as the system prompt.
  2. LLM as a Judge: Affected. There is no system prompt used by default
  3. BFCL-v3: Not Affected. All the models above (Sky-T1, Qwen2.5-32B-Instruct, QwQ) use the same QwenHandler data processing class, which has a hardcoded chat template of Qwen2.5-32B-Instruct

For the sake of consistency, I think it's best if we just use a different revision of QwQ for our evaluation (otherwise, we'd need to have special system prompt handling based on the task). I've made a revision here (refs/pr/58) where the default prompt changes to

You are a helpful and harmless assistant. You are Qwen developed by Alibaba.

I'm going to reevaluate QwQ now based on this. On IFEval for example, score improved from 34.75 to 42.51 by removing the CoT text. (Score for MMLU was not affected, which was expected).

@lynnliu030
Copy link
Member

@SumanthRH are all the re-evals done & updated? We can merge this PR for now if that's done. Thanks!

Signed-off-by: SumanthRH <[email protected]>
@SumanthRH
Copy link
Collaborator Author

@lynnliu030 updating Arena hard right now, and then we should be good.

@SumanthRH SumanthRH requested a review from lynnliu030 January 23, 2025 19:50
@erictang000
Copy link
Collaborator

@lynnliu030 Updated arena-hard - should be all set with the re-evals!

Copy link
Member

@lynnliu030 lynnliu030 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erictang000 @SumanthRH I think this can be merged now, thanks for the work!

@lynnliu030 lynnliu030 merged commit 7e02a76 into NovaSky-AI:main Jan 24, 2025
StephenXie pushed a commit to StephenXie/SkyThought that referenced this pull request Jan 30, 2025
* init

Signed-off-by: SumanthRH <[email protected]>

* x

Signed-off-by: SumanthRH <[email protected]>

* x

Signed-off-by: SumanthRH <[email protected]>

* x

Signed-off-by: SumanthRH <[email protected]>

* x

Signed-off-by: SumanthRH <[email protected]>

* Update README.md

* Update README.md

* Update README.md

* arena-hard results

* make ifeval tp size consistent; move evals md file

Signed-off-by: SumanthRH <[email protected]>

* x

Signed-off-by: SumanthRH <[email protected]>

* x

Signed-off-by: SumanthRH <[email protected]>

* add arena hard to final table; fix link

Signed-off-by: SumanthRH <[email protected]>

* new results for QwQ

Signed-off-by: SumanthRH <[email protected]>

* Update base_instruct_evals.md

* Update base_instruct_evals.md

* Update README.md

---------

Signed-off-by: SumanthRH <[email protected]>
Co-authored-by: Eric Tang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants