使用Ollama框架测试AIME2024精度异常 #1952

AaroAaron · 2025-03-17T09:55:39Z

AaroAaron
Mar 17, 2025

请教一下，我使用Ollama框架对AIME2024进行精度测试，使用的模型是Deepseek-R1-Distill-Qwen-7B的量化模型，如果使用aime2024_llmverify_repeat8_gen_e8fcee，Acc为62.08，但是我看官方给的paa@1只有55.5，然后我使用aime2024_gen_6e39a4并把max_out_len设为32768，Acc只有3.33
基于Ollama的测试代码：
eval_deepseek_r1_int4.txt
aime2024_gen_6e39a4代码：
aime2024_gen_6e39a4.txt
麻烦帮忙看一下使用Ollama的配置（主要是models）对不对呀？

AaroAaron · 2025-03-17T11:37:37Z

AaroAaron
Mar 17, 2025
Author

使用LMEvaluator时设置温度参数为0.6，在使用MATHEvaluator作为判别器时是不是应该将温度参数设置为0或者0.001减小生成答案的随机性？

0 replies

tonysy · 2025-03-17T12:58:16Z

tonysy
Mar 17, 2025
Maintainer

Please consider aime2024_llmverify_repeat8_gen_e8fcee as reference. aime2024_gen_6e39a4 will truncate the max output length to 2048, which will be deprecated in the future. Also, you may need to repeat 64 for a stable performance.

1 reply

AaroAaron Mar 18, 2025
Author

Thank you for your prompt reply. I still have some questions:

Why is it still truncated after I set max_out_len to 32768 when using aime2024_gen_6e39a4?
aime2024_llmverify_repeat8_gen_e8fcee uses GenericLLMEvaluator as the evaluator, while aime2024_gen_6e39a4 uses MATHEvaluator. If I want to test the pass@1, which one should I use?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

使用Ollama框架测试AIME2024精度异常 #1952

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

使用Ollama框架测试AIME2024精度异常 #1952

Uh oh!

AaroAaron Mar 17, 2025

Replies: 2 comments · 1 reply

Uh oh!

AaroAaron Mar 17, 2025 Author

Uh oh!

tonysy Mar 17, 2025 Maintainer

Uh oh!

AaroAaron Mar 18, 2025 Author

AaroAaron
Mar 17, 2025

Replies: 2 comments 1 reply

AaroAaron
Mar 17, 2025
Author

tonysy
Mar 17, 2025
Maintainer

AaroAaron Mar 18, 2025
Author