Filtering Sampled for Sequence Length

dakies · November 16, 2024, 4:21pm

Hi,

I’m using the SFT trainer to fine-tune a code model. As truncated code doesn’t make a lot of sense, I filter samples before training so they are shorter than max_seq_length.

What I do currently, is to 1)tokenise 2)filter for length 3)pass the untokenized samples to the trainer. This is inefficient as the trainer then tokenizes again…

First I wanted to ask is there a way to autmatically filter samples? This would seam like a useful feature.

Second, can someone please point me to an example of how to pass tokens to the SFTTrainer directly. I tried, but the SFTTrainer complained about missing labels.

Thanks for your help!

Topic		Replies	Views
Finetuning with SFTtrainer Intermediate	1	428	June 12, 2024
Does setting max_seq_length to a too large number for fine tuning LLM using SFTTrainer affects model training? Beginners	1	1866	December 6, 2024
Max Seq Lengths Beginners	1	563	December 6, 2024
Using Seq2SeqTrainer for decoders? 🤗Transformers	0	85	December 25, 2024
Truncating sequence -- within a pipeline Beginners	7	5783	May 3, 2024

Filtering Sampled for Sequence Length

Related topics