I am having difficulty getting my code to log metrics periodically to wandb, so I can check that I am checkpointing correctly. Specifically, although I am running my model for 10 epochs (with 2 examples per epoch for debugging) and am requesting logging every 2 steps, my wandb output displays only the very last metric for both train and eval, a single dot. The metric corresponds correctly to the output for epoch 10.
Could you please help me find the issue in my code/understanding?
I am adapting the following script to get it to save validation checkpoints periodically:
Specifically, the above after parsing my arguments:
python3 run_mlm.py --model_name_or_path bert-base-uncased --do_train --do_eval --output_dir ./models/development/Alex/with_tags --train_file ./finetune/child/Alex/train.txt --validation_file ./finetune/child/Alex/val.txt --max_train_samples 2 --max_eval_samples 2 --overwrite_output_dir
and overwriting the default values in the TrainingArguments as follows in my version of the run_mlm.py:
# Added these lines training_args.load_best_model_at_end = True training_args.metric_for_best_model = "eval_loss" # end added # 8/7/21 added is_child = model_args.model_name_or_path != 'bert-base-uncased' num_epochs = 10 if is_child else 10 # Debug mode only!!! # end add # 8/1/21 added line training_args.save_total_limit = 1 strategy = "steps" training_args.logging_strategy = strategy training_args.evaluation_strategy = strategy training_args.save_strategy = strategy # For the child scripts logger.info('run_mlm.py is in debug mode and is requesting epoch = 20 for non-child! Need to revert!') training_args.save_steps = interval_steps training_args.logging_steps = interval_steps training_args.eval_steps = interval_steps # end added # For now train for fewer epochs because perplexity difference is not very large. training_args.num_train_epochs = num_epochs training_args.learning_rate = learning_rate # end additions