What I mean is I want to SFT model a few steps, and then I use that same model to do the DPO.
Or create a Datacollator of about 15 samples to perform SFT. After performing SFT of all 15 samples, I made another Datacollator of about 5 samples to perform DPO, repeating the above process until I had finished running all the data sets.
Does anyone have any ideas that they can suggest to me?
I wonder why you ask this question. SFT doesn’t contain reference model, so you just load the only one full-weight model to finetune. But dpo will load the model and forzen it.
difference structure, difference loss, difference codebase. so you should not do it.