Skip to content

DeepSeek 模型 MMLU数据集的精度测试结果偏低 #13230

Open
@shawn9977

Description

@shawn9977

镜像:使用intelanalytics/ipex-llm-serving-xpu:0.8.3-b19 或者intelanalytics/ipex-llm-serving-xpu:0.8.3-b21镜像
模型: DeepSeek-R1-Distill-Qwen-32B SYM_INT4 模型
工具: Lighteval
数据集 :MMLU

Benchmark后精度结果值偏低才27.67%。 DeepSeek-R1-Distill-Qwen-32B INT4 模型 在NV A100上Benchmark的精度值为78.82%

(WrapperWithLoadBit pid=10769) 2025:06:13-12:30:17:(10769) |CCL_WARN| device_family is unknown, topology discovery could be incorrect, it might result in suboptimal performance [repeated 2x across cluster]
(WrapperWithLoadBit pid=10769) 2025:06:13-12:30:17:(10769) |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices [repeated 24x across cluster]
(WrapperWithLoadBit pid=10769) -----> current rank: 3, world size: 4, byte_count: 15360000,is_p2p:1 [repeated 2x across cluster]
(WrapperWithLoadBit pid=10769) WARNING 06-13 12:30:19 [_logger.py:68] Pin memory is not supported on XPU. [repeated 2x across cluster]
[2025-06-13 15:38:57,787] [ INFO]: --- COMPUTING METRICS --- (pipeline.py:498)
[2025-06-13 15:38:58,608] [ INFO]: --- DISPLAYING RESULTS --- (pipeline.py:540)

Task Version Metric Value Stderr
all acc 0.2767 ± 0.0332
original:mmlu:_average:0 acc 0.2767 ± 0.0332
original:mmlu:abstract_algebra:0 0 acc 0.2200 ± 0.0416
original:mmlu:anatomy:0 0 acc 0.2370 ± 0.0367
original:mmlu:astronomy:0 0 acc 0.2500 ± 0.0352
original:mmlu:business_ethics:0 0 acc 0.3800 ± 0.0488
original:mmlu:clinical_knowledge:0 0 acc 0.2340 ± 0.0261
original:mmlu:college_biology:0 0 acc 0.3125 ± 0.0388
original:mmlu:college_chemistry:0 0 acc 0.2000 ± 0.0402
original:mmlu:college_computer_science:0 0 acc 0.2700 ± 0.0446
original:mmlu:college_mathematics:0 0 acc 0.2100 ± 0.0409
original:mmlu:college_medicine:0 0 acc 0.2254 ± 0.0319
original:mmlu:college_physics:0 0 acc 0.2157 ± 0.0409
original:mmlu:computer_security:0 0 acc 0.3300 ± 0.0473
original:mmlu:conceptual_physics:0 0 acc 0.3064 ± 0.0301
original:mmlu:econometrics:0 0 acc 0.2368 ± 0.0400
original:mmlu:electrical_engineering:0 0 acc 0.2759 ± 0.0372
original:mmlu:elementary_mathematics:0 0 acc 0.2249 ± 0.0215
original:mmlu:formal_logic:0 0 acc 0.2778 ± 0.0401
original:mmlu:global_facts:0 0 acc 0.2100 ± 0.0409
original:mmlu:high_school_biology:0 0 acc 0.2226 ± 0.0237
original:mmlu:high_school_chemistry:0 0 acc 0.1823 ± 0.0272
original:mmlu:high_school_computer_science:0 0 acc 0.2900 ± 0.0456
original:mmlu:high_school_european_history:0 0 acc 0.3212 ± 0.0365
original:mmlu:high_school_geography:0 0 acc 0.3030 ± 0.0327
original:mmlu:high_school_government_and_politics:0 0 acc 0.2176 ± 0.0298
original:mmlu:high_school_macroeconomics:0 0 acc 0.2538 ± 0.0221
original:mmlu:high_school_mathematics:0 0 acc 0.2111 ± 0.0249
original:mmlu:high_school_microeconomics:0 0 acc 0.2563 ± 0.0284
original:mmlu:high_school_physics:0 0 acc 0.1987 ± 0.0326
original:mmlu:high_school_psychology:0 0 acc 0.3523 ± 0.0205
original:mmlu:high_school_statistics:0 0 acc 0.1620 ± 0.0251
original:mmlu:high_school_us_history:0 0 acc 0.2990 ± 0.0321
original:mmlu:high_school_world_history:0 0 acc 0.3882 ± 0.0317
original:mmlu:human_aging:0 0 acc 0.3453 ± 0.0319
original:mmlu:human_sexuality:0 0 acc 0.3359 ± 0.0414
original:mmlu:international_law:0 0 acc 0.2893 ± 0.0414
original:mmlu:jurisprudence:0 0 acc 0.2963 ± 0.0441
original:mmlu:logical_fallacies:0 0 acc 0.3313 ± 0.0370
original:mmlu:machine_learning:0 0 acc 0.3214 ± 0.0443
original:mmlu:management:0 0 acc 0.2718 ± 0.0441
original:mmlu:marketing:0 0 acc 0.4316 ± 0.0324
original:mmlu:medical_genetics:0 0 acc 0.3000 ± 0.0461
original:mmlu:miscellaneous:0 0 acc 0.3614 ± 0.0172
original:mmlu:moral_disputes:0 0 acc 0.2919 ± 0.0245
original:mmlu:moral_scenarios:0 0 acc 0.2402 ± 0.0143
original:mmlu:nutrition:0 0 acc 0.2516 ± 0.0248
original:mmlu:philosophy:0 0 acc 0.2379 ± 0.0242
original:mmlu:prehistory:0 0 acc 0.2809 ± 0.0250
original:mmlu:professional_accounting:0 0 acc 0.2411 ± 0.0255
original:mmlu:professional_law:0 0 acc 0.2477 ± 0.0110
original:mmlu:professional_medicine:0 0 acc 0.1875 ± 0.0237
original:mmlu:professional_psychology:0 0 acc 0.3105 ± 0.0187
original:mmlu:public_relations:0 0 acc 0.2818 ± 0.0431
original:mmlu:security_studies:0 0 acc 0.2939 ± 0.0292
original:mmlu:sociology:0 0 acc 0.2985 ± 0.0324
original:mmlu:us_foreign_policy:0 0 acc 0.3200 ± 0.0469
original:mmlu:virology:0 0 acc 0.2892 ± 0.0353
original:mmlu:world_religions:0 0 acc 0.4386 ± 0.0381

[2025-06-13 15:38:58,686] [ INFO]: --- SAVING AND PUSHING RESULTS --- (pipeline.py:530)
[2025-06-13 15:38:58,686] [ INFO]: Saving experiment tracker (evaluation_tracker.py:196)
[2025-06-13 15:39:07,447] [ INFO]: Saving results to /llm/intelmc8/shawn/project/lighteval/results/results/_llm_intelmc8_models_DeepSeek-R1-Distill-Qwen-32B/results_2025-06-13T15-38-58.686645.json (evaluation_tracker.py:265)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions