MLM train loss is very different after version update

Hi,
I use the run_mlm.py script to pretrain from scratch BERT model. Not sure if this is the most updated version of the script since I’ve been using it for a couple of months.
When I used transofrmers 4.5.1 I used to get train loss of ~1.9 but after updating to transformers 4.9.2 the train loss is ~4.5.
I’m training from scratch on my data file + trained tokenizer and I ran the exact same command on both times.

This are the training argument on 4.5.1:

output_dir=test-mlm-wiki, overwrite_output_dir=True, do_train=True, do_eval=None, do_predict=False, evaluation_strategy=IntervalStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=32, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_ratio=0.1, warmup_steps=0, logging_dir=runs/Aug08_11-51-18, logging_strategy=IntervalStrategy.STEPS, logging_first_step=False, logging_steps=500, save_strategy=IntervalStrategy.STEPS, save_steps=5000, save_total_limit=1, no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=auto, fp16_full_eval=False, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=test-mlm-wiki, disable_tqdm=True, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=[], deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, length_column_name=length, report_to=[‘tensorboard’], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, _n_gpu=1, mp_parameters=

These are the training arguments on 4.9.2:

TrainingArguments(_n_gpu=1, adafactor=False, adam_beta1=0.9,
adam_beta2=0.999, adam_epsilon=1e-08, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_find_unused_parameters=None, debug=[], deepspeed=None, disable_tqdm=True, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_steps=None, evaluation_strategy=IntervalStrategy.NO, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, gradient_accumulation_steps=1, greater_is_better=None, group_by_length=False, ignore_data_skip=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=-1, log_level=-1, log_level_replica=-1, log_on_each_node=True, logging_dir=test-mlm-wiki/runs/Aug27_17-12-33, logging_first_step=False, logging_steps=500, logging_strategy=IntervalStrategy.STEPS, lr_scheduler_type=SchedulerType.LINEAR, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=1.0, output_dir=test-mlm-wiki, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=32, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=test-mlm-wiki-sst5-100, push_to_hub_organization=None, push_to_hub_token=None, remove_unused_columns=True, report_to=[‘tensorboard’], resume_from_checkpoint=None, run_name=test-mlm-wik, save_on_each_node=False, save_steps=5000, save_strategy=IntervalStrategy.STEPS,
save_total_limit=1, seed=42, sharded_ddp=[], skip_memory_metrics=True, tpu_metrics_debug=False, tpu_num_cores=None, use_legacy_prediction_loop=False, warmup_ratio=0.1, warmup_steps=0, weight_decay=0.0)

These are the difference I’ve found:

4.5.1 4.9.2
debug=False debug=[]
eval_steps=500 eval_steps=None
logging_dir=runs/Aug08_11-51-18_dogfish-01 logging_dir=est-mlm-wiki/runs/Aug27_17-12-33_dogfish-01
- log_level=-1
- log_level_replica=-1
- log_on_each_node=True
- push_to_hub=False
- push_to_hub_model_id=test-mlm-wiki
- push_to_hub_organization=None
- push_to_hub_token=None
- save_on_each_node=False
- use_legacy_prediction_loop=False
- resume_from_checkpoint=None

Any ideas why this happens?

I wonder if something is going on with the tokenizer…
When running on the 4.5.1 version, I changed in run_mlm script AutoTokenizer to BertTokenizer (it had a bug perhaps, wouldn’t recognize my tokenizer).
On 4.9.2 I get this error when running the script:
“The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization.
The tokenizer class you load from this checkpoint is ‘T5TokenizerFast’.
The class this function is called from is ‘BertTokenizer’.”

That’s the code I’m using to train the tokenizer:
tokenizer = BertWordPieceTokenizer()
tokenizer.train(files=[‘wikipedia_1e6.txt], vocab_size=30000)
tokenizer.save_model(’/tokenizers/vocab_folder’)
tokenizer = BertTokenizer.from_pretrained(’/tokenizers/vocab_folder’)
tokenizer.save_pretrained(’/tokenizers/wikipedia_1e6/’)

However, when training tokenizer with transformers 4.9.2 and using it in run_mlm I still get a bigger train loss (than 4.5.1), but no warning.