The loss of the multi-machine and multi-card Swift Colocate training group oscillates around 0 #3780

PancakeAwesome · 2025-04-07T01:43:50Z

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程，最好有截图)

双机16卡A100 环境下swift grpo colocate模式训练每个node内的所有completion结果都是一样的，reward一直是1， kl和loss一直都是0

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息，如CUDA版本，系统，GPU型号和torch版本等)
cuda12.4 torch2.4 py310 vllm073 swift330dev0 trl0.16.0.dev0

NNODES=${WORLD_SIZE:-1}
NODE_RANK=${RANK:-0}
MASTER_ADDR=${MASTER_ADDR:-127.0.0.1}
MASTER_PORT=${MASTER_PORT:-$RANDOM_PORT}
NPROC_PER_NODE=8
swift rlhf
--rlhf_type grpo
--model DeepSeek-R1-Distill-Qwen-32B/
--train_type full
--dataset train.jsonl
--torch_dtype bfloat16
--num_train_epochs 999
--max_length 2048
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--learning_rate 5e-7
--save_total_limit 2
--logging_steps 1
--eval_steps 20
--save_steps 20
--output_dir /deepseek_distill_qwen_32b_grpo_reward_w_vllm_k8s
--gradient_accumulation_steps 2
--warmup_ratio 0.05
--dataloader_num_workers 4
--max_completion_length 2048
--reward_funcs accuracy format
--num_generations 8
--use_vllm true
--vllm_gpu_memory_utilization 0.3
--sleep_level 1
--deepspeed zero3_offload
--num_infer_workers 8
--tensor_parallel_size 8
--temperature 1.0
--beta 0.001
--max_grad_norm 1.0
--temperature 0.6
--top_p 0.9
--top_k 50
--repetition_penalty 1.03
--move_model_batches 6
--offload_optimizer true
--offload_model true
--async_generate false
--gc_collect_after_offload true
--model_type deepseek_r1_distill
--log_completions true
--report_to tensorboard

Additional context
Add any other context about the problem here(在这里补充其他信息)
每个机器node之间的completion结果不是一样的

PancakeAwesome · 2025-04-07T01:44:40Z

@tastelikefeet

hjh0119 · 2025-04-07T02:52:55Z

duplicated issue? #3745

PancakeAwesome · 2025-04-08T08:55:20Z

duplicated issue? #3745

It's not the same issue. Essentially, in a multi-machine and multi-card environment during the GRO training stage, the model's output completion is normal, but the overall loss keeps oscillating around 0, and both KL and gradnorm are constantly 0.

PancakeAwesome changed the title ~~多机多卡swift Colocate训练grpo出现loss在0附近震荡~~ The loss of the multi-machine and multi-card Swift Colocate training group oscillates around 0 Apr 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The loss of the multi-machine and multi-card Swift Colocate training group oscillates around 0 #3780

The loss of the multi-machine and multi-card Swift Colocate training group oscillates around 0 #3780

PancakeAwesome commented Apr 7, 2025 •

edited

Loading

PancakeAwesome commented Apr 7, 2025

hjh0119 commented Apr 7, 2025

PancakeAwesome commented Apr 8, 2025

The loss of the multi-machine and multi-card Swift Colocate training group oscillates around 0 #3780

The loss of the multi-machine and multi-card Swift Colocate training group oscillates around 0 #3780

Comments

PancakeAwesome commented Apr 7, 2025 • edited Loading

PancakeAwesome commented Apr 7, 2025

hjh0119 commented Apr 7, 2025

PancakeAwesome commented Apr 8, 2025

PancakeAwesome commented Apr 7, 2025 •

edited

Loading