Data Preparation for CausalLM

blueeagle · March 15, 2023, 1:05pm

Hi together,

I want to train a CausalLM (gpt2) according to this course.
Hereby, I am using the DataCollatorforLM with the flag mlm set to False.
However, I am still unsure about how exactly the batches are generated from one sample.
Given a tokenized sample

[10, 14, 36, 28, 30, 31, 77, 100, 101]

the data collator is returning the input and label for training

input = [10, 14, 36, 28, 30, 31, 77, 100, 101]
label = [10, 14, 36, 28, 30, 31, 77, 100, 101]

In the documentation of the datacollator I already found, that the labels will be shifted right automatically during training by the model. Still, for causal language modeling I would want to create multiple inputs and labels of the given sample, so that the model will have to predict the correct token at each position, hence:

input = [
	[10,  0,  0,  0,  0,  0,  0,   0,   0]
	[10, 14,  0,  0,  0,  0,  0,   0,   0]
	[10, 14, 36,  0,  0,  0,  0,   0,   0]
	[10, 14, 36, 28,  0,  0,  0,   0,   0]
	[10, 14, 36, 28, 30,  0,  0,   0,   0]
	[10, 14, 36, 28, 30, 31,  0,   0,   0]
	[10, 14, 36, 28, 30, 31, 77,   0,   0]
	[10, 14, 36, 28, 30, 31, 77, 100,   0]
]
label = [
	[10, 14,  0,  0,  0,  0,  0,   0,   0]
	[10, 14, 36,  0,  0,  0,  0,   0,   0]
	[10, 14, 36, 28,  0,  0,  0,   0,   0]
	[10, 14, 36, 28, 30,  0,  0,   0,   0]
	[10, 14, 36, 28, 30, 31,  0,   0,   0]
	[10, 14, 36, 28, 30, 31, 77,   0,   0]
	[10, 14, 36, 28, 30, 31, 77, 100,   0]
	[10, 14, 36, 28, 30, 31, 77, 100, 101]
]

My question is now:
Is this done automatically by the „CausalLM“ model or do I have to implement this by myself in a custom dataloader/dataset?

blueeagle · March 16, 2023, 7:46am

I just figured it out by myself.
As explained in this video, assuring that the model predicts to correct next token at each position is usually done by using a triangular mask in the self-attention layer and not by passing all possibilities as a separate sample.
By looking at the GPT2 implementation from huggingface, I found that the GPT2Attention module implements a triangular causal_mask for this, thus there should be no need for preprocessing the data manually as asked above

Topic		Replies	Views
How to train causal language model 🤗Transformers	0	338	January 18, 2024
Where does the Transformers do the target text shifting in causal LM? Beginners	4	5015	February 21, 2025
Error in DataCollator section of Hugging Face Tutorial LM fine tuning Beginners	2	264	January 12, 2024
How is the data shifted by one token during CausalLM fine tuning Models	4	3269	April 14, 2025
Documentation: Transformers Language Modeling Section Beginners	0	325	May 14, 2022

Data Preparation for CausalLM

Related topics