When using the run_MLM
script and pairing with XLA, I am seeing that despite logging to files I still donât get a step-by-step output of the metrics.
%%bash
python xla_spawn.py --num_cores=8 ./run_mlm.py --output_dir="./results" \
--model_type="big_bird" \
--config_name="./config" \
--tokenizer_name="./tokenizer" \
--train_file="./dataset.txt" \
--validation_file="./val.txt" \
--line_by_line="True" \
--max_seq_length="16000" \
--weight_decay="0.01" \
--per_device_train_batch_size="1" \
--per_device_eval_batch_size="1" \
--learning_rate="3e-4" \
--tpu_num_cores='8' \
--warmup_steps="1000" \
--overwrite_output_dir \
--pad_to_max_length \
--num_train_epochs="1" \
--adam_beta1="0.9" \
--adam_beta2="0.98" \
--do_train \
--do_eval \
--logging_steps="10" \
--evaluation_strategy="steps" \
--eval_accumulation_steps='10' \
--report_to="tensorboard" \
--logging_dir='./logs' \
--save_strategy="epoch" \
--load_best_model_at_end='True' \
--metric_for_best_model='accuracy' \
--skip_memory_metrics='False' \
--gradient_accumulation_steps='500' \
--use_fast_tokenizer='True' \
--log_level='info' \
--logging_first_step='True' \
1> >(tee -a stdout.log) \
2> >(tee -a stderr.log >&2)
As you can see, I am logging out stderr and stdout to files but I can see that it doesnât log any step - only the end-of-epoch ones when training is finished. Using TensorBoard also doesnât help when loss isnât being logged anyways which is quite weird.
I have adjusted logging_steps
but that doesnât seem to help. I am quite confused - Trainer
is supposed to ouput loss to the Cell output
too, but that doesnât happen either.
Does anyone know how I can log the metrics for ânâ steps?