[Evals] Add results on non-reasoning tasks #26

SumanthRH · 2025-01-17T22:14:49Z

What does this PR do?

More evaluation results from @erictang000 and I. We evaluated the OSS models on instruction-following and QA benchmarks like IFEval, MMLU etc

Metric	Sky-T1-32B-Preview	Qwen-2.5-32B-Instruct	QwQ-32B-Preview	Eval Implementation
MMLU (0 shot; no CoT)	78.36	74.14	71.23	lm_eval
MMLU (5 shot; no CoT)	82.46	82.62	82.32	lm_eval
ARC-C (0 shot; no CoT)	49.49	49.4	46.25	lm_eval
IFEval	74.68	79.3	34.75	lm_eval
LLM-as-a-Judge	9.12	9.19	8.36	fastchat
MGSM (0 shot; `direct`)	33	42.3	15.5	lm_eval
MGSM (8-shot; `direct`)	58.4	61.47	59.97	lm_eval
BFCL-v3	53.18	58.92	17.41	BFCL

Note: We've included ARC-C here as well, separate from the ARC-C results with CoT that is implemented in #21 . We've tried to use the standard no-CoT setting used for evaluating base and instruct models - with assistant response prefill ("<assistant_header> The best answer is") + single token decode.

Signed-off-by: SumanthRH <[email protected]>

caoshiyi · 2025-01-17T23:37:58Z

Hi @SumanthRH, can we also add eval on Arena-hard?

erictang000 · 2025-01-18T00:42:01Z

@caoshiyi working on it now!

DachengLi1 · 2025-01-18T01:27:05Z

Thank you. This is amazing @SumanthRH ! Could you also include a column of results from qwen-32b-instruct and QwQ-32B-Preview run on lm-eval, to make sure the current implementation is correct? Thank you!

erictang000 · 2025-01-18T04:46:15Z

@caoshiyi Ran Arena-hard evals for NovaSky-AI/Sky-T1-32B-Preview, Qwen/Qwen2.5-32B-Instruct, and Qwen/QwQ-32B-Preview. Scores are generated against the 72 other models currently found at https://huggingface.co/spaces/lmsys/arena-hard-browser. Here are the current results (+ the top score which is o1-mini):

model	score	rating_q025	rating_q975	CI	avg_tokens	date
o1-mini-2024-09-12	91.98	90.88	93.12	(-1.10, +1.14)	1399.0	2025-01-18
sky-T1-32B-Preview	74.79	72.28	76.8	(-2.51, +2.01)	847.0	2025-01-18
qwen2.5-32b-instruct	66.51	64.55	68.4	(-1.96, +1.89)	611.0	2025-01-18
qwq-32b-preview	52.6	50.86	54.91	(-1.74, +2.31)	1005.0	2025-01-23

Full results can be found here:
https://github.com/erictang000/arena-hard-auto/blob/skynova_evals/leaderboard/arena_hard_leaderboard_20250123.csv

Cached answers + judge ratings for regenerating the leaderboard can be found here:
https://github.com/erictang000/arena-hard-auto/tree/skynova_evals/data/arena-hard-v0.1

SumanthRH · 2025-01-18T05:14:33Z

@DachengLi1 We've actually computed results for QwQ and Qwen 32B instruct ! It's in the results table in the Readme. On lm_eval, for reproduction, I only gave the command for Sky-T1, and for the other models, it's just a matter of substituting the model ID. Hope that makes sense!

base_instruct_evals.md

caoshiyi

The results look great! Thanks for the efforts @erictang000 @SumanthRH ! Also, I am wondering how much efforts are needed if we also want a unified script to run all of these evaluations since currently it still requires the users to set up things differently for different benchmarks (e.g., lm_eval and bfcl).

lynnliu030

Another things as I'm trying out with this eval suite, it seems like the IFEval is not printing any ultimate accuracy output according to the instructions? the other tasks on lm_eval works ok. Not sure if @SumanthRH @erictang000 you have encountered with this problem before

Signed-off-by: SumanthRH <[email protected]>

lynnliu030 · 2025-01-21T21:06:41Z

Also another thing I've been noticing is that, we need to provide the same system instruction for our model as in the reasoning task evaluation. https://github.com/NovaSky-AI/SkyThought/blob/main/skythought/tools/util/model_utils.py#L32 So additional parameter of setting this for the lm_eval need to be set I believe.

I run this again and get similar scores tho, only slightly lower in MMLU. Maybe you guys can rerun and verify it again? @SumanthRH @erictang000 Thanks for the great help!

SumanthRH · 2025-01-21T21:26:01Z

Another things as I'm trying out with this eval suite, it seems like the IFEval is not printing any ultimate accuracy output according to the instructions? the other tasks on lm_eval works ok. Not sure if @SumanthRH @erictang000 you have encountered with this problem before

Thanks for pointing this out @lynnliu030 ! I missed an instruction for lm_eval : we need to install the dependencies for the ifeval task with the ifeval extra:

pip install -e .[ifeval]

I updated the installation instructions.

Also I noticed that I had used tensor_parallel_size=4 for IFEval while the rest use 8. To be consistent, I have changed this value to 8 and re-ran the evaluation for IFEval. I noticed that there is some variation (+-1%) in the scores depending on tensor_parallel_size used. I've also added a disclaimer about this in base_instruct_evals.md

lynnliu030 · 2025-01-21T21:40:37Z

@SumanthRH sounds good to me! Btw I think the main README.md reference to the instruction files also need to change as well.

Signed-off-by: SumanthRH <[email protected]>

SumanthRH · 2025-01-21T22:20:55Z

@lynnliu030 Done!

SumanthRH · 2025-01-22T06:55:03Z

I revisited QwQ evaluation and found that the current results for QwQ are incorrect due to their chat template.

TLDR: QwQ's chat template defaults to a CoT-based system prompt when none is provided:

You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step.

You can see the same here: https://huggingface.co/Qwen/QwQ-32B-Preview/blob/91906fe41a48b6a89ce2970abfd1269eefee170e/tokenizer_config.json#L197

What this means is that for tasks where a system message was not provided, the above CoT system prompt was added to all the samples. This can significantly change the results depending on the task.

lm_eval results: Some tasks are affected. For ex., IFEval does not have any system prompt, but MMLU has dataset description as the system prompt.
LLM as a Judge: Affected. There is no system prompt used by default
BFCL-v3: Not Affected. All the models above (Sky-T1, Qwen2.5-32B-Instruct, QwQ) use the same QwenHandler data processing class, which has a hardcoded chat template of Qwen2.5-32B-Instruct

For the sake of consistency, I think it's best if we just use a different revision of QwQ for our evaluation (otherwise, we'd need to have special system prompt handling based on the task). I've made a revision here (refs/pr/58) where the default prompt changes to

You are a helpful and harmless assistant. You are Qwen developed by Alibaba.

I'm going to reevaluate QwQ now based on this. On IFEval for example, score improved from 34.75 to 42.51 by removing the CoT text. (Score for MMLU was not affected, which was expected).

lynnliu030 · 2025-01-23T19:01:08Z

@SumanthRH are all the re-evals done & updated? We can merge this PR for now if that's done. Thanks!

Signed-off-by: SumanthRH <[email protected]>

SumanthRH · 2025-01-23T19:18:43Z

@lynnliu030 updating Arena hard right now, and then we should be good.

erictang000 · 2025-01-23T21:00:33Z

@lynnliu030 Updated arena-hard - should be all set with the re-evals!

lynnliu030

@erictang000 @SumanthRH I think this can be merged now, thanks for the work!

* init Signed-off-by: SumanthRH <[email protected]> * x Signed-off-by: SumanthRH <[email protected]> * x Signed-off-by: SumanthRH <[email protected]> * x Signed-off-by: SumanthRH <[email protected]> * x Signed-off-by: SumanthRH <[email protected]> * Update README.md * Update README.md * Update README.md * arena-hard results * make ifeval tp size consistent; move evals md file Signed-off-by: SumanthRH <[email protected]> * x Signed-off-by: SumanthRH <[email protected]> * x Signed-off-by: SumanthRH <[email protected]> * add arena hard to final table; fix link Signed-off-by: SumanthRH <[email protected]> * new results for QwQ Signed-off-by: SumanthRH <[email protected]> * Update base_instruct_evals.md * Update base_instruct_evals.md * Update README.md --------- Signed-off-by: SumanthRH <[email protected]> Co-authored-by: Eric Tang <[email protected]>

SumanthRH added 4 commits January 17, 2025 22:02

init

7beabfa

Signed-off-by: SumanthRH <[email protected]>

x

ca9224e

Signed-off-by: SumanthRH <[email protected]>

x

7449590

Signed-off-by: SumanthRH <[email protected]>

x

bde721f

Signed-off-by: SumanthRH <[email protected]>

SumanthRH marked this pull request as ready for review January 17, 2025 22:33

x

f9dcf28

Signed-off-by: SumanthRH <[email protected]>

SumanthRH changed the title ~~[Evals] Add results on instruction-following tasks~~ [Evals] Add results on non-reasoning tasks Jan 17, 2025

erictang000 added 3 commits January 17, 2025 14:51

Update README.md

c3578bd

Update README.md

61500fd

Update README.md

afa12c7

arena-hard results

3b124a8

tyler-griggs requested review from lynnliu030 and tyler-griggs January 18, 2025 19:05

lynnliu030 reviewed Jan 18, 2025

View reviewed changes

base_instruct_evals.md Outdated Show resolved Hide resolved

caoshiyi requested review from DachengLi1 and caoshiyi January 18, 2025 23:24

caoshiyi reviewed Jan 18, 2025

View reviewed changes

lynnliu030 reviewed Jan 21, 2025

View reviewed changes

SumanthRH added 2 commits January 21, 2025 20:49

make ifeval tp size consistent; move evals md file

5d7c922

Signed-off-by: SumanthRH <[email protected]>

x

cbe5da4

Signed-off-by: SumanthRH <[email protected]>

SumanthRH added 2 commits January 21, 2025 21:43

x

8261a3c

Signed-off-by: SumanthRH <[email protected]>

add arena hard to final table; fix link

fd5f082

Signed-off-by: SumanthRH <[email protected]>

new results for QwQ

72f1656

Signed-off-by: SumanthRH <[email protected]>

erictang000 added 2 commits January 23, 2025 11:47

Update base_instruct_evals.md

4684bed

Update base_instruct_evals.md

ef4d84b

SumanthRH requested a review from lynnliu030 January 23, 2025 19:50

Update README.md

68b1108

lynnliu030 approved these changes Jan 24, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/main' into even-more-evals

3ae264b

lynnliu030 merged commit 7e02a76 into NovaSky-AI:main Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Evals] Add results on non-reasoning tasks #26

[Evals] Add results on non-reasoning tasks #26

SumanthRH commented Jan 17, 2025 •

edited

Loading

caoshiyi commented Jan 17, 2025

erictang000 commented Jan 18, 2025

DachengLi1 commented Jan 18, 2025

erictang000 commented Jan 18, 2025 •

edited

Loading

SumanthRH commented Jan 18, 2025 •

edited

Loading

caoshiyi left a comment •

edited

Loading

lynnliu030 left a comment

lynnliu030 commented Jan 21, 2025

SumanthRH commented Jan 21, 2025 •

edited

Loading

lynnliu030 commented Jan 21, 2025

SumanthRH commented Jan 21, 2025

SumanthRH commented Jan 22, 2025 •

edited

Loading

lynnliu030 commented Jan 23, 2025

SumanthRH commented Jan 23, 2025

erictang000 commented Jan 23, 2025

lynnliu030 left a comment

[Evals] Add results on non-reasoning tasks #26

[Evals] Add results on non-reasoning tasks #26

Conversation

SumanthRH commented Jan 17, 2025 • edited Loading

What does this PR do?

caoshiyi commented Jan 17, 2025

erictang000 commented Jan 18, 2025

DachengLi1 commented Jan 18, 2025

erictang000 commented Jan 18, 2025 • edited Loading

SumanthRH commented Jan 18, 2025 • edited Loading

caoshiyi left a comment • edited Loading

Choose a reason for hiding this comment

lynnliu030 left a comment

Choose a reason for hiding this comment

lynnliu030 commented Jan 21, 2025

SumanthRH commented Jan 21, 2025 • edited Loading

lynnliu030 commented Jan 21, 2025

SumanthRH commented Jan 21, 2025

SumanthRH commented Jan 22, 2025 • edited Loading

lynnliu030 commented Jan 23, 2025

SumanthRH commented Jan 23, 2025

erictang000 commented Jan 23, 2025

lynnliu030 left a comment

Choose a reason for hiding this comment

SumanthRH commented Jan 17, 2025 •

edited

Loading

erictang000 commented Jan 18, 2025 •

edited

Loading

SumanthRH commented Jan 18, 2025 •

edited

Loading

caoshiyi left a comment •

edited

Loading

SumanthRH commented Jan 21, 2025 •

edited

Loading

SumanthRH commented Jan 22, 2025 •

edited

Loading