If you look at the main documentation for DPO, or the alignment handbook examples, it appear the prompt is separated from “chosen” and “rejected” in datasets before training: Such as from the alignment handbook:
def process(row):
# *** Note, this seperates out the prompt messages from chosen/rejected ***
prompt_messages = row["chosen"][:-1]
chosen_messages = row["chosen"][-1:]
rejected_messages = row["rejected"][-1:]
row["prompt"] = tokenizer.apply_chat_template(prompt_messages, tokenize=False)
row["chosen"] = tokenizer.apply_chat_template(chosen_messages, tokenize=False)
row["rejected"] = tokenizer.apply_chat_template(rejected_messages, tokenize=False)
ds = ds.map(
process,
num_proc=multiprocessing.cpu_count(),
load_from_cache_file=False,
)
However, in the official TRL examples prompt is not separated out:
def process(row):
# *** Prompt was not separated out ***
row["chosen"] = tokenizer.apply_chat_template(row["chosen"], tokenize=False)
row["rejected"] = tokenizer.apply_chat_template(row["rejected"], tokenize=False)
return row
ds = ds.map(
process,
num_proc=multiprocessing.cpu_count(),
load_from_cache_file=False,
)
These two methods are going to send different results to the DPO trainer. Why the discrepancy? It is because it does not matter? Or is the a bug in one of them, such as the TRL example code?
Thanks!