Does order of training data matter when fine-tuning a BERT or RoBERTa model?

Let’s say I am training a model by fine-tuning transformers model. Let’s stick with the example in the HuggingFace course:

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

We then load some data, tokenize, define some training arguments and train:

from transformers import Trainer

trainer = Trainer(

I understand from this post that:

Fine-tuning is the act of re-training a pretrained model on a new dataset/task, it has nothing to do with freezing part of the network

I understand this to mean that when we train the model, the attention weights can change over time based on the input. So if we used the example input sentence:

The animal didn’t cross the street because it was too tired

We might initially get attention heads for the word “it” that look like:

But if we imagine that:

  1. There are millions of sentences in the fine-tuning data.
  2. Each of these sentences is very different to the original training data.
  3. The sentences are different in such a way that the attention-heads for the word “it” end up considering the word “because” to have a high weighting when it is next to the word “it” (unlike above where it has a very low one).

Would we get a situation where the attention heads for a similar sentence towards the end of the training data give more weight to “because” in the above example for a sentence towards the end of the data?

If so - is this why we train for many epochs? And do we generally know how much this solves the problem?