When to use a DataCollator for SFTTrainer

AdL8 · April 1, 2024, 5:02pm

Hello,

In the SFTTrainer document, it is stated that if the dataset is in the right format, we dont need to specify a DataCollator with a response_template.

However, after I formatted my dataset with the TinyLlama/TinyLlama-1.1B-Chat-v1.0 tokenizer.apply_chat_template, the labels are not correct in the train dataloader.

Here is a sample from the dataloader of the SFTTrainer:

( I removed the <(s)> )
Input : <|user|>Which is bigger, the moon or the sun?'<|assistant|> The sun.
Inputs_ids : tensor([ 1, 529, 29989, 1792, 29989, 29958, 13, 8809, 436, 338,
16600, 29892, 278, 18786, 470, 278, 6575, 29973, 2, 29871,
13, 29966, 29989, 465, 22137, 29989, 29958, 13, 1576, 6575,
29889, 2, 29871, 13, 2, 2, 2, 2])

Labels : tensor([ 1, 529, 29989, 1792, 29989, 29958, 13, 8809, 436, 338,
16600, 29892, 278, 18786, 470, 278, 6575, 29973, -100, 29871,
13, 29966, 29989, 465, 22137, 29989, 29958, 13, 1576, 6575,
29889, -100, 29871, 13, -100, -100, -100, -100])

Is the labels supposed to be like that ? Or is it incorrect ?

marches7 · March 15, 2025, 11:52am

Well, I think your ‘Labels’ are wrong. Since Llama is a decoder-only transformers model, it uses the previous tokens to predict the next token, your ‘Labels’ should only mask the instruction’s input_ids. That means the first few tokens of your ‘Labels’ should be -100.

Topic		Replies	Views
Best practice for usage of Data Collator For CompletionOnlyLM in multi-turn chat 🤗Transformers	2	697	May 25, 2025
Question about llama fine tuning dataset token string Beginners	1	14	May 17, 2025
SFT Trainer and chat templates Beginners	3	382	March 26, 2025
Llama inference with apply_chat_template Beginners	0	216	November 30, 2024
SFTTrainer for Llama-2 Intermediate	0	91	August 3, 2024

When to use a DataCollator for SFTTrainer

Related topics