Does order of training data matter when fine-tuning a BERT or RoBERTa model?

samr · August 31, 2022, 10:25am

Let’s say I am training a model by fine-tuning transformers model. Let’s stick with the example in the HuggingFace course:

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

We then load some data, tokenize, define some training arguments and train:

from transformers import Trainer

trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
)

I understand from this post that:

Fine-tuning is the act of re-training a pretrained model on a new dataset/task, it has nothing to do with freezing part of the network

I understand this to mean that when we train the model, the attention weights can change over time based on the input. So if we used the example input sentence:

The animal didn’t cross the street because it was too tired

We might initially get attention heads for the word “it” that look like:

But if we imagine that:

There are millions of sentences in the fine-tuning data.
Each of these sentences is very different to the original training data.
The sentences are different in such a way that the attention-heads for the word “it” end up considering the word “because” to have a high weighting when it is next to the word “it” (unlike above where it has a very low one).

Would we get a situation where the attention heads for a similar sentence towards the end of the training data give more weight to “because” in the above example for a sentence towards the end of the data?

If so - is this why we train for many epochs? And do we generally know how much this solves the problem?

Topic		Replies	Views
Does the tokenization in BERT change after fine-tuning? Models	0	592	January 27, 2023
Finetuning with Trainer doesn't seem to learn since second epoch Beginners	3	2416	January 19, 2023
Questions about my first code on fine-tuning BERT model for text-classification Beginners	0	1509	April 26, 2022
Training embeddings of tokens 🤗Transformers	2	5203	January 27, 2021
Continue Pre-Training Roberta Intermediate	3	2689	May 18, 2023

Does order of training data matter when fine-tuning a BERT or RoBERTa model?

Related topics