Worse performance when training with albert-large

I’m training a MLM on my own corpus of clinical charts. When training with albert-base-v1, I reach a eval loss of 2.6 but when I run on albert-large-v2 the eval loss quickly plateaus at around 6.0. I tried different lr with not much difference. I expected albert-large to perform better. why is it worse than base?

here are my trainer settings for albert-base-v1:

from transformers import Trainer, TrainingArguments
train_batch_size = 32
training_args = TrainingArguments(
    output_dir = './',
    logging_dir = './', 
    report_to = "wandb",
    overwrite_output_dir = True,
    num_train_epochs = 20,
    warmup_steps = 500,  
    learning_rate = 5e-5,
    logging_steps = 200,
    per_device_train_batch_size = train_batch_size,
    per_device_eval_batch_size= 8,
    save_steps=len_rf/train_batch_size,
    gradient_accumulation_steps=1,
    save_total_limit = 2,
    load_best_model_at_end = True,
    evaluation_strategy = "steps",
    do_train = True,
    do_eval = True,
    eval_accumulation_steps = 8,
    adafactor = True
)

trainer = Trainer(
    model=ehr_model,
    #optimizers = optimizers,
    args=training_args,
    data_collator = data_collator,
    train_dataset = train_dataset,
    eval_dataset = val_dataset
)

for albert-large-v2, I only changes this
from transformers import Trainer, TrainingArguments
train_batch_size = 8
training_args = TrainingArguments(
    learning_rate = 5e-5,
    gradient_accumulation_steps=4