GRPO with a lora model after SFT #588

Pandasea · 2025-04-09T07:18:24Z

I want to use GRPO to train a lora model after SFT, but the reward does not look normal. Is the model loaded correctly in this way?

grpo.py
#############################

Initialize the GRPO trainer

#############################
if script_args.load_in_lora:
base_model = AutoModelForCausalLM.from_pretrained(
model_args.model_name_or_path,
torch_dtype = torch_dtype,
trust_remote_code=model_args.trust_remote_code,
attn_implementation=model_args.attn_implementation,
use_cache=False if training_args.gradient_checkpointing else True
)
model = PeftModel.from_pretrained(base_model, model_id=script_args.lora_dir, is_trainable=True)
model_in = model
else:
model_kwargs = dict(
revision=model_args.model_revision,
trust_remote_code=model_args.trust_remote_code,
attn_implementation=model_args.attn_implementation,
torch_dtype=torch_dtype,
use_cache=False if training_args.gradient_checkpointing else True,
)
training_args.model_init_kwargs = model_kwargs
model_in = model_args.model_name_or_path

trainer = GRPOTrainer(
model = model_in,
reward_funcs=reward_funcs,
args=training_args,
train_dataset=dataset[script_args.dataset_train_split],
eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
peft_config=get_peft_config(model_args),
callbacks=get_callbacks(training_args, model_args),
processing_class=tokenizer,
)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GRPO with a lora model after SFT #588

GRPO with a lora model after SFT #588

Pandasea commented Apr 9, 2025

GRPO with a lora model after SFT #588

GRPO with a lora model after SFT #588

Comments

Pandasea commented Apr 9, 2025

Initialize the GRPO trainer