I’m training a MLM on my own corpus of clinical charts. When training with albert-base-v1, I reach a eval loss of 2.6 but when I run on albert-large-v2 the eval loss quickly plateaus at around 6.0. I tried different lr with not much difference. I expected albert-large to perform better. why is it worse than base?
here are my trainer settings for albert-base-v1:
from transformers import Trainer, TrainingArguments train_batch_size = 32 training_args = TrainingArguments( output_dir = './', logging_dir = './', report_to = "wandb", overwrite_output_dir = True, num_train_epochs = 20, warmup_steps = 500, learning_rate = 5e-5, logging_steps = 200, per_device_train_batch_size = train_batch_size, per_device_eval_batch_size= 8, save_steps=len_rf/train_batch_size, gradient_accumulation_steps=1, save_total_limit = 2, load_best_model_at_end = True, evaluation_strategy = "steps", do_train = True, do_eval = True, eval_accumulation_steps = 8, adafactor = True ) trainer = Trainer( model=ehr_model, #optimizers = optimizers, args=training_args, data_collator = data_collator, train_dataset = train_dataset, eval_dataset = val_dataset ) for albert-large-v2, I only changes this from transformers import Trainer, TrainingArguments train_batch_size = 8 training_args = TrainingArguments( learning_rate = 5e-5, gradient_accumulation_steps=4