No loss being logged, when running MLM script (Colab)

Neel-Gupta · July 7, 2021, 11:52am

When using the run_MLM script and pairing with XLA, I am seeing that despite logging to files I still don’t get a step-by-step output of the metrics.

%%bash
python xla_spawn.py --num_cores=8 ./run_mlm.py --output_dir="./results" \
    --model_type="big_bird" \
    --config_name="./config" \
    --tokenizer_name="./tokenizer" \
    --train_file="./dataset.txt" \
    --validation_file="./val.txt" \
    --line_by_line="True" \
    --max_seq_length="16000" \
    --weight_decay="0.01" \
    --per_device_train_batch_size="1" \
    --per_device_eval_batch_size="1" \
    --learning_rate="3e-4" \
    --tpu_num_cores='8' \
    --warmup_steps="1000" \
    --overwrite_output_dir \
    --pad_to_max_length \
    --num_train_epochs="1" \
    --adam_beta1="0.9" \
    --adam_beta2="0.98" \
    --do_train \
    --do_eval \
    --logging_steps="10" \
    --evaluation_strategy="steps" \
    --eval_accumulation_steps='10' \
    --report_to="tensorboard" \
    --logging_dir='./logs' \
    --save_strategy="epoch" \
    --load_best_model_at_end='True' \
    --metric_for_best_model='accuracy' \
    --skip_memory_metrics='False'  \
    --gradient_accumulation_steps='500' \
    --use_fast_tokenizer='True' \
    --log_level='info' \
    --logging_first_step='True' \
    1> >(tee -a stdout.log) \
    2> >(tee -a stderr.log >&2)

As you can see, I am logging out stderr and stdout to files but I can see that it doesn’t log any step - only the end-of-epoch ones when training is finished. Using TensorBoard also doesn’t help when loss isn’t being logged anyways which is quite weird.

I have adjusted logging_steps but that doesn’t seem to help. I am quite confused - Trainer is supposed to ouput loss to the Cell output too, but that doesn’t happen either.

Does anyone know how I can log the metrics for ‘n’ steps?

Neel-Gupta · July 7, 2021, 12:14pm

Basically, despite providing the logging_steps argument, it doesn’t apparently override the default which I presume to be set to epoch - same with evaluation strategy which also runs during epochs instead of the no. of steps provided.

This is what the script receives on its side:-

adafactor=False,
adam_beta1=0.9,
adam_beta2=0.98,
adam_epsilon=1e-08,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=10,
eval_steps=10,
evaluation_strategy=IntervalStrategy.STEPS,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=500,
greater_is_better=True,
group_by_length=False,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.0003,
length_column_name=length,
load_best_model_at_end=True,
local_rank=-1,
log_level=20,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=./logs,
logging_first_step=True,
logging_steps=10,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=accuracy,
mp_parameters=,
no_cuda=False,
num_train_epochs=1.0,
output_dir=./results,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=1,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=results,
push_to_hub_organization=None,
push_to_hub_token=None,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=./results,
save_on_each_node=False,
save_steps=500,
save_strategy=IntervalStrategy.EPOCH,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=False,
tpu_metrics_debug=False,
tpu_num_cores=8,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=1000,
weight_decay=0.01,
)

Which seem to be true to my provided flags, but just not being acted upon.
Will dig more in the script to see what might be the issue.

Neel-Gupta · July 8, 2021, 5:44pm

I have tried this with WandB too - and the problem persists. suffice to say, its simply that Trainer is not logging those metrics at all.

RylanSchaeffer · October 4, 2021, 5:54pm

Did you ever find a cause or solution?

nielsr · October 5, 2021, 7:20am

cc @sgugger

sgugger · October 5, 2021, 12:01pm

There is no problem as far as I can see, either on TPU or Colab. If there is a more recent reproducible example, I’m happy to look at it.

RylanSchaeffer · October 5, 2021, 4:01pm

When I add the following arguments to the TrainingArguments, the validation loss appears on wandb.

        do_eval=True,
        evaluation_strategy="step",
        eval_steps=1,

This suggests to me that the default behavior of trainer.train() does not perform validation at all, even at the end. Is this correct?

sgugger · October 5, 2021, 6:34pm

The default is evaluation_strategy="no", yes, as is clearly documented in the TrainingArguments. Evaluation in NLP often takes a long time (as in the question-anszering, translation, summarization scripts for instance), which is why there is that default.

If you want to evaluate after training, just run trainer.evaluate()

RylanSchaeffer · October 5, 2021, 7:19pm

Three comments:

The number of arguments in TrainingArguments is overwhelming. There’s no way I would immediately see that.
I don’t think that’s customary at all. Validation is almost always run in tandem with training (not necessarily step by step, but every epoch or every few epochs) to check for overfitting.
If I pass validation data to an object, I expect it to do something with that data. A default of doing nothing is counter-intuitive.

sgugger · October 6, 2021, 12:18pm

Before being rude like this on the forums is counterproductive and against our code of conduct. May I remind you that all of this is for free: the software, the course, the documentation and the help.

You could also have checked the course on the tools you are using which introduces the arguments to use.

RylanSchaeffer · October 13, 2021, 10:25pm

I wasn’t the one who started this. You wrote “as is clearly documented,” implying I missed something obvious, so I had to clarify that what you subjectively think is obvious was not obvious to others.

Directing me to the course isn’t helpful. I’m already drowning in too much documentation.

Neel-Gupta · October 14, 2021, 3:12pm

That’s how HuggingFace has been; not to mention a tendency of the authors to be hostile towards people with feedback on how they can make things accessible.

The problem is that HF markets itself very well, which leads to many publications to publish models here rather than using standalone clean modules. A pity, but a monopoly over NLP models here does at least provide a platform for easy access.

There’s also a lack of testing of their own code, often missing obvious bugs and snippets that won’t even let modules load; Again, they can market they have the latest models while not really providing them in working quality, calling it “WIP”

Its a startup in the end, so scummy tactics and angry employees aren’t really below the expectations. What I would advise is be thankful for what you get already

Topic		Replies	Views
How loss is calculated in MLM training 🤗Transformers	0	847	April 1, 2022
IndexError: index out of bound, MLM+XLA 🤗Transformers	1	439	June 29, 2021
Run_mlm.py: Why does eval_loss at the last epoch differ from the do_eval eval_loss? Beginners	2	786	March 18, 2021
Constant output predictions on test data 🤗Transformers	0	508	September 29, 2022
MLM train loss is very different after version update 🤗Transformers	1	438	August 29, 2021

No loss being logged, when running MLM script (Colab)

Related topics