Skip to content

[WIP] Agent Training with Remote Service + Gym like protocal #973

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 21 commits into
base: main
Choose a base branch
from

Conversation

HMJiangGatech
Copy link

@HMJiangGatech HMJiangGatech commented Apr 8, 2025

Implement logics for support agent training:

The implementation is based on the environment implemented in https://github.com/HMJiangGatech/verl_agent_env_examples

The environment supports:

  • initialize environment: return initial observation
  • step environment: taking action, and returning the observation and reward

The communication with the environment is through chat message lists in openai format.


The implementation needed in Verl are:

  1. dataset designed for agent training
  2. ray trainer that supports multiturn rollout

Experimental Script:

  1. Generate dataset in https://github.com/HMJiangGatech/verl_agent_env_examples/blob/master/examples/verl/sokoban/curate_data.py
  2. Run the script
python3 -m verl.trainer.main_agent_ppo \
    algorithm.adv_estimator=gae \
    env.environment_endpoint=http://localhost:8000 \
    env.max_turn=10 \
    data.agent_prompt_style=qwen2_5 \
    data.train_files=$HOME/code/verl_agent_env_examples/examples/verl/sokoban/data/train.parquet \
    data.val_files=$HOME/code/verl_agent_env_examples/examples/verl/sokoban/data/test.parquet \
    data.train_batch_size=256 \
    data.max_prompt_length=15360 \
    data.max_response_length=128 \
    data.truncation='error' \
    actor_rollout_ref.model.path=$HOME/code/models/qwen2_5-7b-instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=256 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.actor.use_kl_loss=False \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=32 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
    actor_rollout_ref.rollout.max_num_batched_tokens=16384 \
    critic.optim.lr=1e-5 \
    critic.model.use_remove_padding=True \
    critic.model.path=$HOME/code/models/qwen2_5-7b-instruct \
    critic.model.enable_gradient_checkpointing=True \
    critic.ppo_micro_batch_size_per_gpu=32 \
    critic.model.fsdp_config.param_offload=False \
    critic.model.fsdp_config.optimizer_offload=False \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger=['console','mlflow'] \
    trainer.project_name='verl_agent_env_examples' \
    trainer.experiment_name='sokoban_qwen2_5-7b' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=1 \
    trainer.save_freq=-1 \
    trainer.test_freq=1 \
    trainer.total_epochs=15 $@

@CLAassistant
Copy link

CLAassistant commented Apr 8, 2025

CLA assistant check
All committers have signed the CLA.

@sunjin-k
Copy link
Contributor

sunjin-k commented Apr 11, 2025

Hello, awesome work.
I was wondering how much this PR's functionalities overlap with PR #917 and issue #385?

@HMJiangGatech
Copy link
Author

Hello, awesome work. I was wondering how much this PR's functionalities overlap with PR #917 and issue #385?

@sunjin-k
This implementation is more abstracted and lightweight compared to #917. It's agnostic to whichever inference engine is used.
However, #917 involves a much deeper modification at the sglang worker level, which has more potential to improve agent flow efficiency. My guess of #917 is it may reuse the kv and save a significant part of the prefill.

The amazon team will be figuring out the right way to contribute back to the main.

@ChrisRBXiong
Copy link

Hello, has there been any further progress on this work?

@ccclyu
Copy link
Collaborator

ccclyu commented May 31, 2025

@ChrisRBXiong I am rebasing this PR into recipe folder and refactoring the code to test the functionality. Will post latest progress after I am done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants