[WIP] Agent Training with Remote Service + Gym like protocal #973

HMJiangGatech · 2025-04-08T08:10:12Z

Implement logics for support agent training:

The implementation is based on the environment implemented in https://github.com/HMJiangGatech/verl_agent_env_examples

The environment supports:

initialize environment: return initial observation
step environment: taking action, and returning the observation and reward

The communication with the environment is through chat message lists in openai format.

The implementation needed in Verl are:

dataset designed for agent training
ray trainer that supports multiturn rollout

Experimental Script:

Generate dataset in https://github.com/HMJiangGatech/verl_agent_env_examples/blob/master/examples/verl/sokoban/curate_data.py
Run the script

python3 -m verl.trainer.main_agent_ppo \
    algorithm.adv_estimator=gae \
    env.environment_endpoint=http://localhost:8000 \
    env.max_turn=10 \
    data.agent_prompt_style=qwen2_5 \
    data.train_files=$HOME/code/verl_agent_env_examples/examples/verl/sokoban/data/train.parquet \
    data.val_files=$HOME/code/verl_agent_env_examples/examples/verl/sokoban/data/test.parquet \
    data.train_batch_size=256 \
    data.max_prompt_length=15360 \
    data.max_response_length=128 \
    data.truncation='error' \
    actor_rollout_ref.model.path=$HOME/code/models/qwen2_5-7b-instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
    actor_rollout_ref.rollout.max_num_batched_tokens=16384 \
    critic.optim.lr=1e-5 \
    critic.model.use_remove_padding=True \
    critic.model.path=$HOME/code/models/qwen2_5-7b-instruct \
    critic.model.enable_gradient_checkpointing=True \
    critic.ppo_micro_batch_size_per_gpu=32 \
    critic.model.fsdp_config.param_offload=False \
    critic.model.fsdp_config.optimizer_offload=False \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger=['console','mlflow'] \
    trainer.project_name='verl_agent_env_examples' \
    trainer.experiment_name='sokoban_qwen2_5-7b' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=1 \
    trainer.save_freq=-1 \
    trainer.test_freq=1 \
    trainer.total_epochs=15 $@

rebase main

CLAassistant · 2025-04-08T08:36:57Z

All committers have signed the CLA.

sunjin-k · 2025-04-11T04:14:35Z

Hello, awesome work.
I was wondering how much this PR's functionalities overlap with PR #917 and issue #385?

HMJiangGatech · 2025-04-11T23:49:41Z

Hello, awesome work. I was wondering how much this PR's functionalities overlap with PR #917 and issue #385?

@sunjin-k
This implementation is more abstracted and lightweight compared to #917. It's agnostic to whichever inference engine is used.
However, #917 involves a much deeper modification at the sglang worker level, which has more potential to improve agent flow efficiency. My guess of #917 is it may reuse the kv and save a significant part of the prefill.

The amazon team will be figuring out the right way to contribute back to the main.

ChrisRBXiong · 2025-05-17T09:16:09Z

Hello, has there been any further progress on this work?

ccclyu · 2025-05-31T21:52:59Z

@ChrisRBXiong I am rebasing this PR into recipe folder and refactoring the code to test the functionality. Will post latest progress after I am done.

Add full authors of SGLang RL team. Thanks!

HMJiangGatech added 10 commits April 6, 2025 00:16

add rl agent dataset

2c22c96

implement basic entry point (WIP)

c004e7b

refactor logic to AgenEnv class

352cfa8

improve dataset setup

db1245c

handle json serialized env kwargs

6c3e9d6

update stepping function

68447fc

finish the agent loop in validation

427b02a

remove useless parameters

e6a0991

Merge pull request #1 from volcengine/main

6db11d8

rebase main

improve agent trainer to match the ppo trainer

e4cd318

HMJiangGatech added 4 commits April 8, 2025 20:25

improve validation loop and testing fit loop

8d3a669

get model_generated_mask, tokenwise_reward for agent

f1c1d39

implement the advantage calculation

bdabfb4

runnable agent

6dc076b

HMJiangGatech added 5 commits April 11, 2025 08:29

improve parsing of message list

8ffda53

Merge branch 'volcengine:main' into jhaoming/agent

d9021be

support directly importing external spec

2da5e15

refact agent implement to catch up main

dc2d733

clean format

e9bb605

zhaochenyang20 and others added 2 commits June 1, 2025 20:48

docs: update sglang_worker authors (volcengine#1038)

c4c8ca5

Add full authors of SGLang RL team. Thanks!

update: organize agent ppo code to recipe and refactor

b342b80

ccclyu added the tool-call label Jun 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Agent Training with Remote Service + Gym like protocal #973

[WIP] Agent Training with Remote Service + Gym like protocal #973

HMJiangGatech commented Apr 8, 2025 •

edited

Loading

Uh oh!

CLAassistant commented Apr 8, 2025 •

edited

Loading

Uh oh!

sunjin-k commented Apr 11, 2025 •

edited

Loading

Uh oh!

HMJiangGatech commented Apr 11, 2025

Uh oh!

ChrisRBXiong commented May 17, 2025

Uh oh!

ccclyu commented May 31, 2025

Uh oh!

Uh oh!

[WIP] Agent Training with Remote Service + Gym like protocal #973

Are you sure you want to change the base?

[WIP] Agent Training with Remote Service + Gym like protocal #973

Conversation

HMJiangGatech commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CLAassistant commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sunjin-k commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HMJiangGatech commented Apr 11, 2025

Uh oh!

ChrisRBXiong commented May 17, 2025

Uh oh!

ccclyu commented May 31, 2025

Uh oh!

Uh oh!

HMJiangGatech commented Apr 8, 2025 •

edited

Loading

CLAassistant commented Apr 8, 2025 •

edited

Loading

sunjin-k commented Apr 11, 2025 •

edited

Loading