New Version of PPOTrainer

Hi, I’m trying to create a PPO loop for the Gemma2b model, I have already trained my reward model.

In the new class of PPOTrainer - policy, ref_policy and reward_model have type ‘module’ unlike the previous version of PPOTrainer.
Now when I’m passing the huggingFace model as policy,ref_policy and my trained reward model, I’m getting the following error -

what to pass in policy, ref_policy and reward model ?

1 Like

This is the only example I could find…

but…I can’t understand how to pass huggingface Pretrained wrapper model as module in policy,ref_policy and reward…could you please help me ?

1 Like

I have no idea, too. But maybe this?

from transformers import (
    AutoModelForCausalLM,
    AutoModelForSequenceClassification,
    AutoTokenizer,
)

from trl.trainer.rloo_trainer import RLOOConfig, RLOOTrainer
from trl.trainer.utils import SIMPLE_QUERY_CHAT_TEMPLATE


base_model_name = "EleutherAI/pythia-1b-deduped"
tokenizer = AutoTokenizer.from_pretrained(base_model_name, padding_side="left")
tokenizer.add_special_tokens({"pad_token": "[PAD]"})
if tokenizer.chat_template is None:
    tokenizer.chat_template = SIMPLE_QUERY_CHAT_TEMPLATE
reward_model = AutoModelForSequenceClassification.from_pretrained(base_model_name, num_labels=1)
ref_policy = AutoModelForCausalLM.from_pretrained(base_model_name)
policy = AutoModelForCausalLM.from_pretrained(base_model_name)

train_dataset = ...  # make sure to have columns "input_ids"
eval_dataset = ...

trainer = RLOOTrainer(
    config=RLOOConfig(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=64,
        total_episodes=30000,
    ),
    tokenizer=tokenizer,
    policy=policy,
    ref_policy=ref_policy,
    reward_model=reward_model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)
trainer.train()

is it like DPOTrainer…I mean here we no need to write PPO training loop like we do in DPO ?

1 Like

You might be able to write it, but unlike DPOTrainer, PPOTrainer has a default value for the reward model path, so it seems to work without specifying any special options.
However, there is so little documentation that it’s really hard to understand…

reward_model_path (str, optional, defaults to "EleutherAI/pythia-160m") — Path to the reward model.