Hi,
I use the run_mlm.py script to pretrain from scratch BERT model. Not sure if this is the most updated version of the script since I’ve been using it for a couple of months.
When I used transofrmers 4.5.1 I used to get train loss of ~1.9 but after updating to transformers 4.9.2 the train loss is ~4.5.
I’m training from scratch on my data file + trained tokenizer and I ran the exact same command on both times.
This are the training argument on 4.5.1:
output_dir=test-mlm-wiki, overwrite_output_dir=True, do_train=True, do_eval=None, do_predict=False, evaluation_strategy=IntervalStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=32, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_ratio=0.1, warmup_steps=0, logging_dir=runs/Aug08_11-51-18, logging_strategy=IntervalStrategy.STEPS, logging_first_step=False, logging_steps=500, save_strategy=IntervalStrategy.STEPS, save_steps=5000, save_total_limit=1, no_cuda=False, seed=42, fp16=False, fp16_opt_level=O1, fp16_backend=auto, fp16_full_eval=False, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name=test-mlm-wiki, disable_tqdm=True, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=[], deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, length_column_name=length, report_to=[‘tensorboard’], ddp_find_unused_parameters=None, dataloader_pin_memory=True, skip_memory_metrics=False, _n_gpu=1, mp_parameters=
These are the training arguments on 4.9.2:
TrainingArguments(_n_gpu=1, adafactor=False, adam_beta1=0.9,
adam_beta2=0.999, adam_epsilon=1e-08, dataloader_drop_last=False, dataloader_num_workers=0, dataloader_pin_memory=True, ddp_find_unused_parameters=None, debug=[], deepspeed=None, disable_tqdm=True, do_eval=False, do_predict=False, do_train=True, eval_accumulation_steps=None, eval_steps=None, evaluation_strategy=IntervalStrategy.NO, fp16=False, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, gradient_accumulation_steps=1, greater_is_better=None, group_by_length=False, ignore_data_skip=False, label_names=None, label_smoothing_factor=0.0, learning_rate=5e-05, length_column_name=length, load_best_model_at_end=False, local_rank=-1, log_level=-1, log_level_replica=-1, log_on_each_node=True, logging_dir=test-mlm-wiki/runs/Aug27_17-12-33, logging_first_step=False, logging_steps=500, logging_strategy=IntervalStrategy.STEPS, lr_scheduler_type=SchedulerType.LINEAR, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, no_cuda=False, num_train_epochs=1.0, output_dir=test-mlm-wiki, overwrite_output_dir=True, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=32, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=test-mlm-wiki-sst5-100, push_to_hub_organization=None, push_to_hub_token=None, remove_unused_columns=True, report_to=[‘tensorboard’], resume_from_checkpoint=None, run_name=test-mlm-wik, save_on_each_node=False, save_steps=5000, save_strategy=IntervalStrategy.STEPS,
save_total_limit=1, seed=42, sharded_ddp=[], skip_memory_metrics=True, tpu_metrics_debug=False, tpu_num_cores=None, use_legacy_prediction_loop=False, warmup_ratio=0.1, warmup_steps=0, weight_decay=0.0)
These are the difference I’ve found:
4.5.1 | 4.9.2 |
---|---|
debug=False | debug=[] |
eval_steps=500 | eval_steps=None |
logging_dir=runs/Aug08_11-51-18_dogfish-01 | logging_dir=est-mlm-wiki/runs/Aug27_17-12-33_dogfish-01 |
- | log_level=-1 |
- | log_level_replica=-1 |
- | log_on_each_node=True |
- | push_to_hub=False |
- | push_to_hub_model_id=test-mlm-wiki |
- | push_to_hub_organization=None |
- | push_to_hub_token=None |
- | save_on_each_node=False |
- | use_legacy_prediction_loop=False |
- | resume_from_checkpoint=None |
Any ideas why this happens?