Hello,
I am having difficulty getting my code to log metrics periodically to wandb, so I can check that I am checkpointing correctly. Specifically, although I am running my model for 10 epochs (with 2 examples per epoch for debugging) and am requesting logging every 2 steps, my wandb output displays only the very last metric for both train and eval, a single dot. The metric corresponds correctly to the output for epoch 10.
Could you please help me find the issue in my code/understanding?
I am adapting the following script to get it to save validation checkpoints periodically:
Specifically, the above after parsing my arguments:
python3 run_mlm.py --model_name_or_path bert-base-uncased --do_train --do_eval --output_dir ./models/development/Alex/with_tags --train_file ./finetune/child/Alex/train.txt --validation_file ./finetune/child/Alex/val.txt --max_train_samples 2 --max_eval_samples 2 --overwrite_output_dir
and overwriting the default values in the TrainingArguments as follows in my version of the run_mlm.py:
# Added these lines
training_args.load_best_model_at_end = True
training_args.metric_for_best_model = "eval_loss"
# end added
# 8/7/21 added
is_child = model_args.model_name_or_path != 'bert-base-uncased'
num_epochs = 10 if is_child else 10 # Debug mode only!!!
# end add
# 8/1/21 added line
training_args.save_total_limit = 1
strategy = "steps"
training_args.logging_strategy = strategy
training_args.evaluation_strategy = strategy
training_args.save_strategy = strategy
# For the child scripts
logger.info('run_mlm.py is in debug mode and is requesting epoch = 20 for non-child! Need to revert!')
training_args.save_steps = interval_steps
training_args.logging_steps = interval_steps
training_args.eval_steps = interval_steps
# end added
# For now train for fewer epochs because perplexity difference is not very large.
training_args.num_train_epochs = num_epochs
training_args.learning_rate = learning_rate
# end additions