DPO with Chat Data


I am curious about training an LLM using DPO on a chat dataset. I have a chat dataset with messages between a user and an assistant. I want to create the DPO dataset with ‘prompt’, ‘chosen’, and ‘reject’ fields, where the ‘chosen’ entries are the assistant’s responses, and the ‘reject’ entries are generated by an SFT model I trained. However, I’m having difficulty constructing this dataset. Should each assistant turn in one chat be treated as a separate data sample with the entire chat history in its prompt? Or is there a more efficient way to format these prompts for DPO? Any guidance would be greatly appreciated!