Skip to content

unsatisfactory result and strange reward #605

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
qianfantianyuzhouzhou opened this issue Apr 15, 2025 · 1 comment
Open

unsatisfactory result and strange reward #605

qianfantianyuzhouzhou opened this issue Apr 15, 2025 · 1 comment

Comments

@qianfantianyuzhouzhou
Copy link

Image

Image

it seems that the table's reward only denpends on format reward?

and my training result( MATH dataset on Qwen2.5-3B with GRPO, only 37%) is not good enough?

here is my parameters: the same as example:

@qianfantianyuzhouzhou
Copy link
Author

bf16: true
use_vllm: true
vllm_device: auto
vllm_gpu_memory_utilization: 0.6
do_eval: true
eval_strategy: steps
eval_steps: 100
gradient_accumulation_steps: 8
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
hub_model_id: Qwen-2.5-3B
learning_rate: 3.0e-06
lr_scheduler_type: cosine
max_prompt_length: 512
max_completion_length: 1024
max_steps: -1
num_generations: 6
num_train_epochs: 1
output_dir: data/Qwen-2.5-3B-Simple-RL
overwrite_output_dir: true
per_device_eval_batch_size: 2
per_device_train_batch_size: 2
push_to_hub: false
report_to:

  • wandb
    reward_funcs:
  • accuracy
  • format
    reward_weights:
  • 1.0
  • 1.0
    save_strategy: "no"
    seed: 42
    warmup_ratio: 0.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant