Hi,
I’m using the SFT trainer to fine-tune a code model. As truncated code doesn’t make a lot of sense, I filter samples before training so they are shorter than max_seq_length.
What I do currently, is to 1)tokenise 2)filter for length 3)pass the untokenized samples to the trainer. This is inefficient as the trainer then tokenizes again…
First I wanted to ask is there a way to autmatically filter samples? This would seam like a useful feature.
Second, can someone please point me to an example of how to pass tokens to the SFTTrainer directly. I tried, but the SFTTrainer complained about missing labels.
Thanks for your help!