DPO training data format

In preparing a dataset for DPO (Direct Preference Optimization) training, should the “prompt” be repeated in the “chosen” and “rejected” columns?

I’ve come across some conflicting information regarding the proper formatting of the dataset for DPO training. Some sources suggest that the prompt should be included in both the “chosen” and “rejected” responses to provide full context, while others state that the prompt should be kept separate and not repeated in these columns.

Additionally, when working with multi-turn dialogue data, I’m unsure how to properly format the dataset. Should the “chosen” and “rejected” columns include the entire conversation history up to that point, or just the assistant’s most recent response following the latest user input?

Could someone clarify the correct approach for formatting the dataset? Should the “chosen” and “rejected” columns contain only the assistant’s responses following the prompt, or should they include the prompt as well? And how should I handle multi-turn dialogues in this context?

I also wonder how to prepare multi turn conversation data such as Anthropic/hh-rlhf for DPO

and

Should we add “chosen_rating” and “rejected_rating” into dataset ?

I’m not at all familiar with LLM training, so I can’t help you directly, but this person usually trains and reads papers on the RP model a lot and disseminates information, so I think you can get good information by sending a mentions (@+grimjim) or opening a New Discussion on his repo and contacting him.

1 Like

Hi, thanks, gonna ask him…

1 Like

His activities are irregular, so you should be patient. Also, there are probably quite a few individuals and researchers on HF who are familiar with DPO that I don’t know, but I don’t think many of them frequent the forum.
It would be more reliable to find and follow DPO-related repos and articles and contact them directly.
This is a current flaw in the HF community function and has been pointed out in various ways in the recent and still ongoing request for opinions.

1 Like

Thanks a lot for your help, I did some research on DPO-related repos and articles and asked yesterday, If I get an answer, I’ll share it with you.

I also wonder how to prepare multi turn conversation data such as Anthropic/hh-rlhf for DPO

1 Like

I did some research on DPO-related repos and articles and asked yesterday

Oh, you had already acted.

such as Anthropic/hh-rlhf for DPO

I think the best way to be sure is to ask Anthropic directly, but if he seems taciturn, you could try to find another training data author who seems sociable.
I’m basically just a guy playing with image generation AI and Python, and I’m only tinkering a bit with LLM for tag generation, so I’ll just watch LLM training from afar and taste it when it’s finished.
Well, I’m limited to giving directions.

I was confused because some people added prompt in ‘chosen’ and ‘rejected’ on their notebooks.a

I think the best way is seperating ‘prompt’, ‘chosen’ and ‘rejected’, according to GitHub repo(1) of TRL, DPOtrainer, and hf docs(2) during training it add prompt already I’ll separate columns to avoid repetition.

1- trl/trl/trainer/dpo_trainer.py at main · huggingface/trl · GitHub
2- https://huggingface.co/docs/trl/main/en/dpo_trainer
3-Fine-tune a Mistral-7b model with Direct Preference Optimization | by Maxime Labonne | Towards Data Science
4- How to fine-tune Llama3 using DPO on Brev 🤙 etc…

There is no reliable introductory procedure for generative AI-related technologies in general. Even if there is, the speed of evolution makes them obsolete within a few months.

That may be partly why, in HF, there are so many people who are basically doing things on their own.
Some must be following outdated know-how without realizing it. This case may be it.

There are also probably many raw data sets that have been placed on the assumption that they will be used after being processed by scripts.

1 Like