Hi,
I am trying to fine tune distilgpt2 model on the mnli dataset to turn into a classifier by generating the right label (contradiction, entailment and neutral). I followed the example given on transformers documentation.
I am debugging the Trainer step-by-step because I would like to see how the “input_ids”/“labels” are shifted by one token by using the DataCollator as mentioned in the documentation I provided the link to above, which states the following:
Now create a batch of examples using DataCollatorForLanguageModeling. It’s more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.
Use the end-of-sequence token as the padding token and setmlm=False
. This will use the inputs as labels shifted to the right by one element:>> from transformers import DataCollatorForLanguageModeling >> tokenizer.pad_token = tokenizer.eos_token >> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)```
My code looks as follows:
model_checkpoint = "distilgpt2"
batch_size = 24
sequence_length = block_size = 512
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
tokenizer.pad_token = tokenizer.eos_token
def form_training_prompts(example):
hypothesis = example["hypothesis"]
premise = example["premise"]
class_label = ["entailment", "neutral", "contradiction"][example["label"]]
example[
"text"
] = f"mnli hypothesis: {hypothesis} premise: {premise} target: {class_label}<|endoftext|>"
return example
def tokenizes_text(dataset):
tokenized = tokenizer(dataset["text"], return_tensors="np")
return tokenized
# %%
def group_texts(examples, block_size):
# Concatenate all texts.
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated_examples[list(examples.keys())[0]])
# We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
# customize this part to your needs.
total_length = (total_length // block_size) * block_size
# Split by chunks of max_len.
result = {
k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
result["labels"] = result["input_ids"].copy()
return result
# Create prompts
dataset = dataset.map(
form_training_prompts,
remove_columns=["hypothesis", "premise", "label", "idx"],
load_from_cache_file=False,
desc="Generating text prompt",
)
# Tokenize prompts
dataset = dataset.map(
tokenizes_text,
batched=True,
batch_size=batch_size,
num_proc=1,
remove_columns=dataset.column_names,
load_from_cache_file=False,
desc="Tokenizing text",
)
dataset = dataset.map(
group_texts,
batched=True,
batch_size=batch_size,
num_proc=1,
load_from_cache_file=False,
desc="Packing sequences",
fn_kwargs={"block_size": block_size}
)
The validation set is created with the same process. The code continues as below:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
training_args = TrainingArguments(
output_dir="model",
evaluation_strategy="epoch",
learning_rate=2e-5,
weight_decay=0.01,
)
trainer = Trainer(model=model, args=training_args, train_dataset=dataset,
eval_dataset=eval_dataset, data_collator=data_collator,
)
trainer.train()
As I follow the code while debugging, I print of the “inputs” (from script at line 1872 of the Trainer source code) variable in the training loop looks like this:
{'input_ids': tensor([
[**13**, 383, 3721, ..., 13591, 11, **2592**],
[**13262**, 515, 4560, ..., 9022, 6890, **1597**],
[ **3950**, 3037, 1998, ..., 22908, 11, **22908**],
...,
[ **4038**, 13, 6601, ..., 511, 31156, **4788**],
[ **355**, 857, 3220, ..., 12678, 286, **3113**],
[ **338**, 1410, 13, ..., 642, 38, **2831**]]),
'attention_mask': tensor([[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
...,
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1],
[1, 1, 1, ..., 1, 1, 1]]),
'labels': tensor([
[**13**, 383, 3721, ..., 13591, 11, **2592**],
[**13262**, 515, 4560, ..., 9022, 6890, **1597**],
[ **3950**, 3037, 1998, ..., 22908, 11, **22908**],
...,
[ **4038**, 13, 6601, ..., 511, 31156, **4788**],
[ **355**, 857, 3220, ..., 12678, 286, **3113**],
[ **338**, 1410, 13, ..., 642, 38, **2831**]])}
special variables:
function variables:
data: {'input_ids': tensor([[ 13, 38..., 2831]]), 'attention_mask': tensor([[1, 1, 1, ....1, 1, 1]]), 'labels': tensor([[ 13, 38..., 2831]])}
encodings: None
is_fast: False
n_sequences: None
_MutableMapping__marker: <object object at 0x7f0c93384180>
_abc_impl: <_abc._abc_data object at 0x7f0c803f1900>
_encodings: None
_n_sequences: None
inputs.input_ids
tensor([[ 13, 383, 3721, ..., 13591, 11, 2592],
[13262, 515, 4560, ..., 9022, 6890, 1597],
[ 3950, 3037, 1998, ..., 22908, 11, 22908],
...,
[ 4038, 13, 6601, ..., 511, 31156, 4788],
[ 355, 857, 3220, ..., 12678, 286, 3113],
[ 338, 1410, 13, ..., 642, 38, 2831]])
I have highlighted in double asterixis the first and last token id in the input_ids tensor and labels tensor, and they are the same;
-
going back to my question where does the token shifting take place? Am I misunderstanding how this works?
-
After grouping the data into the same batch (group_text function), how does the model distinguish between the different examples and not assume its all one? Wouldn’t adding an eos token at the end of each prompt example help?