Skip to content

when running the official gsm8k with tool, multi turn async rollout sglang example without any modifications, the model crashes and appears Nan. #1581

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
supermancmk opened this issue May 19, 2025 · 15 comments

Comments

@supermancmk
Copy link

I pulled the latest version of verl's code and when running the official gsm8k with tool, multi turn async rollout sglang example without any modifications, the model crashes for training and after a fixed number of steps, grad_norm and kl loss skyrocket and training and testing rewards drop dramatically to 0 and training appears Nan.
Any solution would be greatly appreciated.
Here is my wandb log.

我拉取了verl最新版本的代码,当跑官方gsm8k with tool, multi turn async rollout sglang 的例子时,没有进行任何修改,模型会训练崩溃,到一个固定步数后,grad_norm和kl loss会急剧上升,训练和测试reward会急剧下降到0,训练出现Nan。
请问有什么解决办法,非常感谢。
下面是我wandb日志。

image
image
image
image
image

@630bdd
Copy link

630bdd commented May 19, 2025

I'm having the same problem.

@dawson-chen
Copy link

same bug when training a search agent using my custom scheduler with async-vllm implementation. async really fxxks me up

Image

@chenhaiq
Copy link
Collaborator

I pulled the latest version of verl's code and when running the official gsm8k with tool, multi turn async rollout sglang example without any modifications, the model crashes for training and after a fixed number of steps, grad_norm and kl loss skyrocket and training and testing rewards drop dramatically to 0 and training appears Nan. Any solution would be greatly appreciated. Here is my wandb log.

我拉取了verl最新版本的代码,当跑官方gsm8k with tool, multi turn async rollout sglang 的例子时,没有进行任何修改,模型会训练崩溃,到一个固定步数后,grad_norm和kl loss会急剧上升,训练和测试reward会急剧下降到0,训练出现Nan。 请问有什么解决办法,非常感谢。 下面是我wandb日志。

image image image image image

Are you using this script: examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn_4xgpu.sh?

@supermancmk
Copy link
Author

supermancmk commented May 20, 2025

I pulled the latest version of verl's code and when running the official gsm8k with tool, multi turn async rollout sglang example without any modifications, the model crashes for training and after a fixed number of steps, grad_norm and kl loss skyrocket and training and testing rewards drop dramatically to 0 and training appears Nan. Any solution would be greatly appreciated. Here is my wandb log.
我拉取了verl最新版本的代码,当跑官方gsm8k with tool, multi turn async rollout sglang 的例子时,没有进行任何修改,模型会训练崩溃,到一个固定步数后,grad_norm和kl loss会急剧上升,训练和测试reward会急剧下降到0,训练出现Nan。 请问有什么解决办法,非常感谢。 下面是我wandb日志。
image image image image image

Are you using this script: examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn_4xgpu.sh?

I use this script: examples/sglang_multiturn/run_qwen2.5-3b_gsm8k_multiturn.sh . I use 4 nodes or 1 node, 8 gpu per node, but the training crash is also the same.

@wuxibin89
Copy link
Collaborator

@dawson-chen Can you try disable vLLM's prefix caching?

diff --git a/verl/workers/rollout/vllm_rollout/vllm_async_server.py b/verl/workers/rollout/vllm_rollout/vllm_async_server.py
index 4f8109e..3d6f612 100644
--- a/verl/workers/rollout/vllm_rollout/vllm_async_server.py
+++ b/verl/workers/rollout/vllm_rollout/vllm_async_server.py
@@ -178,7 +178,7 @@ class AsyncvLLMServer(AsyncServerBase):
             disable_log_stats=config.disable_log_stats,
             max_num_batched_tokens=max_num_batched_tokens,
             enable_chunked_prefill=config.enable_chunked_prefill,
-            enable_prefix_caching=True,
+            enable_prefix_caching=False,
             trust_remote_code=trust_remote_code,
             seed=self.vllm_dp_rank,
         )

@dawson-chen
Copy link

@dawson-chen Can you try disable vLLM's prefix caching?

diff --git a/verl/workers/rollout/vllm_rollout/vllm_async_server.py b/verl/workers/rollout/vllm_rollout/vllm_async_server.py
index 4f8109e..3d6f612 100644
--- a/verl/workers/rollout/vllm_rollout/vllm_async_server.py
+++ b/verl/workers/rollout/vllm_rollout/vllm_async_server.py
@@ -178,7 +178,7 @@ class AsyncvLLMServer(AsyncServerBase):
disable_log_stats=config.disable_log_stats,
max_num_batched_tokens=max_num_batched_tokens,
enable_chunked_prefill=config.enable_chunked_prefill,

  •        enable_prefix_caching=True,
    
  •        enable_prefix_caching=False,
           trust_remote_code=trust_remote_code,
           seed=self.vllm_dp_rank,
       )
    

Thanks @wuxibin89 , I'll give it a try later. My current vLLM version is 0.8.3— should I switch to a newer version?

@SwordFaith
Copy link
Collaborator

I pulled the latest version of verl's code and when running the official gsm8k with tool, multi turn async rollout sglang example without any modifications, the model crashes for training and after a fixed number of steps, grad_norm and kl loss skyrocket and training and testing rewards drop dramatically to 0 and training appears Nan. Any solution would be greatly appreciated. Here is my wandb log.

我拉取了verl最新版本的代码,当跑官方gsm8k with tool, multi turn async rollout sglang 的例子时,没有进行任何修改,模型会训练崩溃,到一个固定步数后,grad_norm和kl loss会急剧上升,训练和测试reward会急剧下降到0,训练出现Nan。 请问有什么解决办法,非常感谢。 下面是我wandb日志。

image image image image image

Could you help revert the format to chatml and rerun it? There might be some discrepancies between the shared WandB log and the current main settings. Your assistance would be greatly appreciated.

@supermancmk
Copy link
Author

I pulled the latest version of verl's code and when running the official gsm8k with tool, multi turn async rollout sglang example without any modifications, the model crashes for training and after a fixed number of steps, grad_norm and kl loss skyrocket and training and testing rewards drop dramatically to 0 and training appears Nan. Any solution would be greatly appreciated. Here is my wandb log.我拉取了 verl 的最新代码,并在运行官方 gsm8k 带工具时,没有进行任何修改的多轮异步滚动 sglang 示例,模型在训练时崩溃,并且在固定数量的步骤后,grad_norm 和 kl 损失飙升,训练和测试奖励急剧下降到 0,训练时出现 Nan。任何解决方案都将不胜感激。这是我的 wandb 日志。
我拉取了verl最新版本的代码,当跑官方gsm8k with tool, multi turn async rollout sglang 的例子时,没有进行任何修改,模型会训练崩溃,到一个固定步数后,grad_norm和kl loss会急剧上升,训练和测试reward会急剧下降到0,训练出现Nan。 请问有什么解决办法,非常感谢。 下面是我wandb日志。
image image image image image

Could you help revert the format to chatml and rerun it? There might be some discrepancies between the shared WandB log and the current main settings. Your assistance would be greatly appreciated.你能帮忙改回 chatml 格式并重新运行吗?共享的 WandB 日志和当前主设置之间可能存在一些差异。非常感谢你的帮助。

Sorry I'm not quite sure how to do it. I used chatml format for training.

@SwordFaith
Copy link
Collaborator

SwordFaith commented May 22, 2025

I pulled the latest version of verl's code and when running the official gsm8k with tool, multi turn async rollout sglang example without any modifications, the model crashes for training and after a fixed number of steps, grad_norm and kl loss skyrocket and training and testing rewards drop dramatically to 0 and training appears Nan. Any solution would be greatly appreciated. Here is my wandb log.我拉取了 verl 的最新代码,并在运行官方 gsm8k 带工具时,没有进行任何修改的多轮异步滚动 sglang 示例,模型在训练时崩溃,并且在固定数量的步骤后,grad_norm 和 kl 损失飙升,训练和测试奖励急剧下降到 0,训练时出现 Nan。任何解决方案都将不胜感激。这是我的 wandb 日志。
我拉取了verl最新版本的代码,当跑官方gsm8k with tool, multi turn async rollout sglang 的例子时,没有进行任何修改,模型会训练崩溃,到一个固定步数后,grad_norm和kl loss会急剧上升,训练和测试reward会急剧下降到0,训练出现Nan。 请问有什么解决办法,非常感谢。 下面是我wandb日志。
image image image image image

Could you help revert the format to chatml and rerun it? There might be some discrepancies between the shared WandB log and the current main settings. Your assistance would be greatly appreciated.你能帮忙改回 chatml 格式并重新运行吗?共享的 WandB 日志和当前主设置之间可能存在一些差异。非常感谢你的帮助。

Sorry I'm not quite sure how to do it. I used chatml format for training.

After reproducing effort done by @zyzshishui , we noticed advantage/max 0 in our currently script at main, which may cause instability in training. And it seems more stable with train bsz & ppo_mini bsz from 256 -> 512 to avoid total batch solve all. Can you check if that works for you ?

advantage/max 0:

Image

new bsz 512 rollout n 8 wandb:
https://wandb.ai/zhaochenyang20/gsm8k_async_rl/runs/2biev775?nw=nwuserzhaochenyang20

@supermancmk
Copy link
Author

supermancmk commented May 22, 2025

I pulled the latest version of verl's code and when running the official gsm8k with tool, multi turn async rollout sglang example without any modifications, the model crashes for training and after a fixed number of steps, grad_norm and kl loss skyrocket and training and testing rewards drop dramatically to 0 and training appears Nan. Any solution would be greatly appreciated. Here is my wandb log.我拉取了 verl 的最新代码,并在运行官方 gsm8k 带工具时,没有进行任何修改的多轮异步滚动 sglang 示例,模型在训练时崩溃,并且在固定数量的步骤后,grad_norm 和 kl 损失飙升,训练和测试奖励急剧下降到 0,训练时出现 Nan。任何解决方案都将不胜感激。这是我的 wandb 日志。
我拉取了verl最新版本的代码,当跑官方gsm8k with tool, multi turn async rollout sglang 的例子时,没有进行任何修改,模型会训练崩溃,到一个固定步数后,grad_norm和kl loss会急剧上升,训练和测试reward会急剧下降到0,训练出现Nan。 请问有什么解决办法,非常感谢。 下面是我wandb日志。
image image image image image

Could you help revert the format to chatml and rerun it? There might be some discrepancies between the shared WandB log and the current main settings. Your assistance would be greatly appreciated.你能帮忙改回 chatml 格式并重新运行吗?共享的 WandB 日志和当前主设置之间可能存在一些差异。非常感谢你的帮助。

Sorry I'm not quite sure how to do it. I used chatml format for training.

After reproducing effort done by @zyzshishui , we noticed advantage/max 0 in our currently script at main, which may cause instability in training. And it seems more stable with train bsz & ppo_mini bsz from 256 -> 512 to avoid total batch solve all. Can you check if that works for you ?

advantage/max 0:

Image new bsz 512 rollout n 8 wandb: https://wandb.ai/zhaochenyang20/gsm8k_async_rl/runs/2biev775?nw=nwuserzhaochenyang20

Thank you very much for your reply, I re-pulled the latest version of verl code and set bsz & ppo_mini bsz to 512, rollout.n to 16 and left the rest unchanged. However, when the training reached 100 steps, the model started to slowly crash and the reward decreased from 90% to 10% in one go with a NAN situation. Here is my environment and training commands, and my wandb logs

非常感谢你的回复,我重新拉取最新版本的verl代码,并将bsz & ppo_mini bsz设置成512,rollout.n设置成16,其余没有改变。但当训练到100步时,模型开始慢慢崩溃,reward从90%一下降低到10%,出现NAN情况。下面是我的环境和训练命令,以及我的wandb 链接和日志

  1. Below is my Install env
conda create -n verl python==3.10 -y
conda activate verl
cd /root/verl_0522
pip install torch torchvision
pip install flash-attn --no-build-isolation
pip install -e .[vllm]
pip install -e .[sglang]
pip install math_verify json5
pip install -U "ray[default]"
  1. Below is My command
set -x

ulimit -n 65535

PROJECT_DIR="$(pwd)"
CONFIG_PATH="$PROJECT_DIR/examples/sglang_multiturn/config"
HOME_DIR=/root/verl_0522
python -m verl.trainer.main_ppo \
    --config-path="$CONFIG_PATH" \
    --config-name='gsm8k_multiturn_grpo' \
    algorithm.adv_estimator=grpo \
    data.train_batch_size=512 \
    data.max_prompt_length=1024 \
    data.max_response_length=1024 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    data.return_raw_chat=True \
    actor_rollout_ref.model.path=/root/Qwen2.5-3B-Instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=512 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=32 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
    actor_rollout_ref.rollout.name=sglang_async \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
    actor_rollout_ref.rollout.n=16 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=32 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \
    trainer.project_name='gsm8k_async_rl_debug_verl' \
    trainer.experiment_name='qwen2.5-3b_function_rm-gsm8k-async-sgl-multi-w-tool-verify-n16-4nodes' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=4 \
    trainer.save_freq=-1 \
    trainer.test_freq=20 \
    data.train_files=$HOME_DIR/data/gsm8k_verl_sgl_multi_turn_preprocessed/train.parquet \
    data.val_files=$HOME_DIR/data/gsm8k_verl_sgl_multi_turn_preprocessed/test.parquet \
    actor_rollout_ref.rollout.multi_turn.tool_config_path="$PROJECT_DIR/examples/sglang_multiturn/config/tool_config/gsm8k_tool_config.yaml" \
    trainer.total_epochs=15 \
    trainer.val_before_train=True $@
  1. Below is my wandb log:

wandb link: https://wandb.ai/luohaipeng12/gsm8k_async_rl_debug_verl?nw=nwuserluohaipeng12

Image
image
image
image
image

@dawson-chen
Copy link

Hi @wuxibin89 , following your suggestions, I conducted 3 controlled experiments to investigate the training crashes.

Environment Setup

Experimental Results

I ran three test configurations:

  • Green line: Using vLLM prefix cache
  • Red line: Without vLLM prefix cache
  • Orange line: Without vLLM prefix cache + max response length extended from 10k to 20k tokens (continued training from step 120 of the Red line)
Image

The results show that disabling vLLM prefix cache do delays the crashes, but doesn't prevent them entirely. All crashes exhibit the same behavior pattern: the model suddenly begins generating repetitive output.

Image

I've already added repetition penalty to the vLLM request parameters:

extra_body={'repetition_penalty': 1.05}

However, this hasn't resolved the issue.

@lebronjamesking
Copy link

I think the verl sglang multi-turn tool calling is working btw. https://github.com/volcengine/verl/blob/54b2677/examples/sglang_multiturn/README.md

@supermancmk
Copy link
Author

I think the verl sglang multi-turn tool calling is working btw. https://github.com/volcengine/verl/blob/54b2677/examples/sglang_multiturn/README.md但我认为 verl sglang 多轮工具调用是正常的。https://github.com/volcengine/verl/blob/54b2677/examples/sglang_multiturn/README.md

May I ask if you train normally? How many steps did you train and can you share your training log? I usually collapse at the end of my training, usually about 100 to 200 steps. But it's normal in the early stage.
Thanks

@yuleiqin
Copy link

yuleiqin commented May 29, 2025

I pulled the latest version of verl's code and when running the official gsm8k with tool, multi turn async rollout sglang example without any modifications, the model crashes for training and after a fixed number of steps, grad_norm and kl loss skyrocket and training and testing rewards drop dramatically to 0 and training appears Nan. Any solution would be greatly appreciated. Here is my wandb log.我拉取了 verl 的最新代码,并在运行官方 gsm8k 带工具时,没有进行任何修改的多轮异步滚动 sglang 示例,模型在训练时崩溃,并且在固定数量的步骤后,grad_norm 和 kl 损失飙升,训练和测试奖励急剧下降到 0,训练时出现 Nan。任何解决方案都将不胜感激。这是我的 wandb 日志。
我拉取了verl最新版本的代码,当跑官方gsm8k with tool, multi turn async rollout sglang 的例子时,没有进行任何修改,模型会训练崩溃,到一个固定步数后,grad_norm和kl loss会急剧上升,训练和测试reward会急剧下降到0,训练出现Nan。 请问有什么解决办法,非常感谢。 下面是我wandb日志。
image image image image image

Could you help revert the format to chatml and rerun it? There might be some discrepancies between the shared WandB log and the current main settings. Your assistance would be greatly appreciated.你能帮忙改回 chatml 格式并重新运行吗?共享的 WandB 日志和当前主设置之间可能存在一些差异。非常感谢你的帮助。

Sorry I'm not quite sure how to do it. I used chatml format for training.

After reproducing effort done by @zyzshishui , we noticed advantage/max 0 in our currently script at main, which may cause instability in training. And it seems more stable with train bsz & ppo_mini bsz from 256 -> 512 to avoid total batch solve all. Can you check if that works for you ?
advantage/max 0:
Image
new bsz 512 rollout n 8 wandb: https://wandb.ai/zhaochenyang20/gsm8k_async_rl/runs/2biev775?nw=nwuserzhaochenyang20

Thank you very much for your reply, I re-pulled the latest version of verl code and set bsz & ppo_mini bsz to 512, rollout.n to 16 and left the rest unchanged. However, when the training reached 100 steps, the model started to slowly crash and the reward decreased from 90% to 10% in one go with a NAN situation. Here is my environment and training commands, and my wandb logs

非常感谢你的回复,我重新拉取最新版本的verl代码,并将bsz & ppo_mini bsz设置成512,rollout.n设置成16,其余没有改变。但当训练到100步时,模型开始慢慢崩溃,reward从90%一下降低到10%,出现NAN情况。下面是我的环境和训练命令,以及我的wandb 链接和日志

  1. Below is my Install env

conda create -n verl python==3.10 -y
conda activate verl
cd /root/verl_0522
pip install torch torchvision
pip install flash-attn --no-build-isolation
pip install -e .[vllm]
pip install -e .[sglang]
pip install math_verify json5
pip install -U "ray[default]"
2. Below is My command

set -x

ulimit -n 65535

PROJECT_DIR="$(pwd)"
CONFIG_PATH="$PROJECT_DIR/examples/sglang_multiturn/config"
HOME_DIR=/root/verl_0522
python -m verl.trainer.main_ppo
--config-path="$CONFIG_PATH"
--config-name='gsm8k_multiturn_grpo'
algorithm.adv_estimator=grpo
data.train_batch_size=512
data.max_prompt_length=1024
data.max_response_length=1024
data.filter_overlong_prompts=True
data.truncation='error'
data.return_raw_chat=True
actor_rollout_ref.model.path=/root/Qwen2.5-3B-Instruct
actor_rollout_ref.actor.optim.lr=1e-6
actor_rollout_ref.model.use_remove_padding=True
actor_rollout_ref.actor.ppo_mini_batch_size=512
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=32
actor_rollout_ref.actor.use_kl_loss=True
actor_rollout_ref.actor.kl_loss_coef=0.001
actor_rollout_ref.actor.kl_loss_type=low_var_kl
actor_rollout_ref.actor.entropy_coeff=0
actor_rollout_ref.model.enable_gradient_checkpointing=True
actor_rollout_ref.actor.fsdp_config.param_offload=False
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32
actor_rollout_ref.rollout.tensor_model_parallel_size=2
actor_rollout_ref.rollout.name=sglang_async
actor_rollout_ref.rollout.gpu_memory_utilization=0.5
actor_rollout_ref.rollout.n=16
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=32
actor_rollout_ref.ref.fsdp_config.param_offload=True
algorithm.use_kl_in_reward=False
trainer.critic_warmup=0
trainer.logger=['console','wandb']
trainer.project_name='gsm8k_async_rl_debug_verl'
trainer.experiment_name='qwen2.5-3b_function_rm-gsm8k-async-sgl-multi-w-tool-verify-n16-4nodes'
trainer.n_gpus_per_node=8
trainer.nnodes=4
trainer.save_freq=-1
trainer.test_freq=20
data.train_files=$HOME_DIR/data/gsm8k_verl_sgl_multi_turn_preprocessed/train.parquet
data.val_files=$HOME_DIR/data/gsm8k_verl_sgl_multi_turn_preprocessed/test.parquet
actor_rollout_ref.rollout.multi_turn.tool_config_path="$PROJECT_DIR/examples/sglang_multiturn/config/tool_config/gsm8k_tool_config.yaml"
trainer.total_epochs=15
trainer.val_before_train=True $@
3. Below is my wandb log:

wandb link: https://wandb.ai/luohaipeng12/gsm8k_async_rl_debug_verl?nw=nwuserluohaipeng12

Image image image image image

Did you try 32GPUs (4 nodes x 8GPUs per node) for 32B Qwen2.5 model? I always failed at the beginning; But for 7B and 3B models, everything went smoothly for at least 100 steps.

Sglang: v0.4.6-post5
verl: 0.3.1-dev
@supermancmk

@yuleiqin
Copy link

https://api.wandb.ai/links/yuleiqin-tencent/tk23kwpp

This is my training curve @supermancmk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants