I’m training a MLM on my own corpus of clinical charts. When training with albert-base-v1, I reach a eval loss of 2.6 but when I run on albert-large-v2 the eval loss quickly plateaus at around 6.0. I tried different lr with not much difference. I expected albert-large to perform better. why is it worse than base?
here are my trainer settings for albert-base-v1:
from transformers import Trainer, TrainingArguments
train_batch_size = 32
training_args = TrainingArguments(
output_dir = './',
logging_dir = './',
report_to = "wandb",
overwrite_output_dir = True,
num_train_epochs = 20,
warmup_steps = 500,
learning_rate = 5e-5,
logging_steps = 200,
per_device_train_batch_size = train_batch_size,
per_device_eval_batch_size= 8,
save_steps=len_rf/train_batch_size,
gradient_accumulation_steps=1,
save_total_limit = 2,
load_best_model_at_end = True,
evaluation_strategy = "steps",
do_train = True,
do_eval = True,
eval_accumulation_steps = 8,
adafactor = True
)
trainer = Trainer(
model=ehr_model,
#optimizers = optimizers,
args=training_args,
data_collator = data_collator,
train_dataset = train_dataset,
eval_dataset = val_dataset
)
for albert-large-v2, I only changes this
from transformers import Trainer, TrainingArguments
train_batch_size = 8
training_args = TrainingArguments(
learning_rate = 5e-5,
gradient_accumulation_steps=4