@A-Alsad for the falcon fine-tune code i tracked back the problem to the _prepare_dataset() function in the trl SFT Trainer (trl/trainer/sft_trainer.py). It changes the dataset column names from “text” to “input_ids” for the openassistant dataset. With my custom dataset it didn’t do this so my “text” column got removed (as described above) and therefore it was empty and producing the error. I have to investigate further but maybe this helps.
Edit1: I figured out the problem with my dataset. For me it had to do with the tokenizing in the _prepare_non_packed_dataloader()
function in the SFTTrainer. If the max_seq_len
is to high (i guess for the specific dataset) it can produce an empty input_batch. It warns about it but continues anyways and removes the original column. That’s basically the root of the error described above. Lowering the max_seq_len
did the trick for me. I don’t know enough about the tokenization process, so maybe someone else could explain.
Edit2: The problems seems to be that the trl sft trainer filters out samples that are shorter than max_seq_len after tokenization. See the github issue posted below. So my fix above doesn not really solve the isseue.