Let’s say I am training a model by fine-tuning transformers model. Let’s stick with the example in the HuggingFace course:
checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
We then load some data, tokenize, define some training arguments and train:
from transformers import Trainer
trainer = Trainer(
model,
training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
)
I understand from this post that:
Fine-tuning is the act of re-training a pretrained model on a new dataset/task, it has nothing to do with freezing part of the network
I understand this to mean that when we train the model, the attention weights can change over time based on the input. So if we used the example input sentence:
The animal didn’t cross the street because it was too tired
We might initially get attention heads for the word “it” that look like:
But if we imagine that:
- There are millions of sentences in the fine-tuning data.
- Each of these sentences is very different to the original training data.
- The sentences are different in such a way that the attention-heads for the word “it” end up considering the word “because” to have a high weighting when it is next to the word “it” (unlike above where it has a very low one).
Would we get a situation where the attention heads for a similar sentence towards the end of the training data give more weight to “because” in the above example for a sentence towards the end of the data?
If so - is this why we train for many epochs? And do we generally know how much this solves the problem?