Text classifier is trained incorrectly using BERT transformers (f1 = 0) for a certain amount of dataset

Hello!
I have a data set (151008 sentences) and only 2 classes (labels).
I wrote a sentence classifier using AutoModelForSequenceClassification and Huggingface Course and I have the following results:
cointegrated/rubert-tiny2 - F1=0.9708
DeepPavlov/rubert-base-cased - F1=0.967
DeepPavlov/rubert-base-cased-conversational - F1=0.9283

I expected to get such results.
BUT! When I use other models (with dataset = 151008 sentences), I get the following results:
sberbank-ai/sbert_large_nlu_ru - F1=0.0
bert-base-multilingual-cased - F1=0.0
image

However, if I use ÂĽ of the dataset (37752 sentences), I get adequate results. I used both the implementation through the Trainer and through the train loop.
Please tell me what I’m doing wrong and how to train the model on a full dataset?
I perform training in the cloud (yandex cloud), JupiterLab environment, 1x V100.
Code:

path = '/home/jupyter/work/resources/Datasets/dataset_raw'
raw_datasets = DatasetDict.load_from_disk(path)
#Tokenize
checkpoint = "bert-base-multilingual-cased"#"DeepPavlov/rubert-base-cased"#'sberbank-ai/sbert_large_nlu_ru'
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
def tokenize_function(example):
    return tokenizer(example["sentence"], truncation=True, max_length=128)
tokenized_datasets_raw = raw_datasets.map(tokenize_function, batched=True)

#Prepare for training
tokenized_datasets = tokenized_datasets_raw.remove_columns(["sentence", "idx","level_0"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")

train_dataset = tokenized_datasets_raw ['train']
eval_dataset = tokenized_datasets_raw ['test']

#CREATE TRAINER
from datasets import load_metric
from transformers import TrainingArguments, Trainer
device = torch.device("cuda")
metric = load_metric("f1")
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2).to(device)
training_args = TrainingArguments(output_dir="/home/jupyter/work/resources/Trash", evaluation_strategy="epoch")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)
trainer.train()

x = trainer.predict(test_dataset=eval_dataset)

I found an error. I set evaluation_strategy=“steps” and the problem was solved.

I am surprised this solved the issue. If you remember, would you mind explaining why that was an error, and how your solution solved the problem?