No loss being logged, when running MLM script (Colab)

When using the run_MLM script and pairing with XLA, I am seeing that despite logging to files I still don’t get a step-by-step output of the metrics.

%%bash
python xla_spawn.py --num_cores=8 ./run_mlm.py --output_dir="./results" \
    --model_type="big_bird" \
    --config_name="./config" \
    --tokenizer_name="./tokenizer" \
    --train_file="./dataset.txt" \
    --validation_file="./val.txt" \
    --line_by_line="True" \
    --max_seq_length="16000" \
    --weight_decay="0.01" \
    --per_device_train_batch_size="1" \
    --per_device_eval_batch_size="1" \
    --learning_rate="3e-4" \
    --tpu_num_cores='8' \
    --warmup_steps="1000" \
    --overwrite_output_dir \
    --pad_to_max_length \
    --num_train_epochs="1" \
    --adam_beta1="0.9" \
    --adam_beta2="0.98" \
    --do_train \
    --do_eval \
    --logging_steps="10" \
    --evaluation_strategy="steps" \
    --eval_accumulation_steps='10' \
    --report_to="tensorboard" \
    --logging_dir='./logs' \
    --save_strategy="epoch" \
    --load_best_model_at_end='True' \
    --metric_for_best_model='accuracy' \
    --skip_memory_metrics='False'  \
    --gradient_accumulation_steps='500' \
    --use_fast_tokenizer='True' \
    --log_level='info' \
    --logging_first_step='True' \
    1> >(tee -a stdout.log) \
    2> >(tee -a stderr.log >&2)

As you can see, I am logging out stderr and stdout to files but I can see that it doesn’t log any step - only the end-of-epoch ones when training is finished. Using TensorBoard also doesn’t help when loss isn’t being logged anyways :thinking: which is quite weird.

I have adjusted logging_steps but that doesn’t seem to help. I am quite confused - Trainer is supposed to ouput loss to the Cell output too, but that doesn’t happen either.

Does anyone know how I can log the metrics for ‘n’ steps?

Basically, despite providing the logging_steps argument, it doesn’t apparently override the default which I presume to be set to epoch - same with evaluation strategy which also runs during epochs instead of the no. of steps provided.

This is what the script receives on its side:-

adafactor=False,
adam_beta1=0.9,
adam_beta2=0.98,
adam_epsilon=1e-08,
dataloader_drop_last=False,
dataloader_num_workers=0,
dataloader_pin_memory=True,
ddp_find_unused_parameters=None,
debug=[],
deepspeed=None,
disable_tqdm=False,
do_eval=True,
do_predict=False,
do_train=True,
eval_accumulation_steps=10,
eval_steps=10,
evaluation_strategy=IntervalStrategy.STEPS,
fp16=False,
fp16_backend=auto,
fp16_full_eval=False,
fp16_opt_level=O1,
gradient_accumulation_steps=500,
greater_is_better=True,
group_by_length=False,
ignore_data_skip=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=0.0003,
length_column_name=length,
load_best_model_at_end=True,
local_rank=-1,
log_level=20,
log_level_replica=-1,
log_on_each_node=True,
logging_dir=./logs,
logging_first_step=True,
logging_steps=10,
logging_strategy=IntervalStrategy.STEPS,
lr_scheduler_type=SchedulerType.LINEAR,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=accuracy,
mp_parameters=,
no_cuda=False,
num_train_epochs=1.0,
output_dir=./results,
overwrite_output_dir=True,
past_index=-1,
per_device_eval_batch_size=1,
per_device_train_batch_size=1,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=results,
push_to_hub_organization=None,
push_to_hub_token=None,
remove_unused_columns=True,
report_to=['tensorboard'],
resume_from_checkpoint=None,
run_name=./results,
save_on_each_node=False,
save_steps=500,
save_strategy=IntervalStrategy.EPOCH,
save_total_limit=None,
seed=42,
sharded_ddp=[],
skip_memory_metrics=False,
tpu_metrics_debug=False,
tpu_num_cores=8,
use_legacy_prediction_loop=False,
warmup_ratio=0.0,
warmup_steps=1000,
weight_decay=0.01,
)

Which seem to be true to my provided flags, but just not being acted upon.
Will dig more in the script to see what might be the issue.

I have tried this with WandB too - and the problem persists. suffice to say, its simply that Trainer is not logging those metrics at all.

Did you ever find a cause or solution?

cc @sgugger

There is no problem as far as I can see, either on TPU or Colab. If there is a more recent reproducible example, I’m happy to look at it.

When I add the following arguments to the TrainingArguments, the validation loss appears on wandb.

        do_eval=True,
        evaluation_strategy="step",
        eval_steps=1,

This suggests to me that the default behavior of trainer.train() does not perform validation at all, even at the end. Is this correct?

The default is evaluation_strategy="no", yes, as is clearly documented in the TrainingArguments. Evaluation in NLP often takes a long time (as in the question-anszering, translation, summarization scripts for instance), which is why there is that default.

If you want to evaluate after training, just run trainer.evaluate()

Three comments:

  1. The number of arguments in TrainingArguments is overwhelming. There’s no way I would immediately see that.
  2. I don’t think that’s customary at all. Validation is almost always run in tandem with training (not necessarily step by step, but every epoch or every few epochs) to check for overfitting.
  3. If I pass validation data to an object, I expect it to do something with that data. A default of doing nothing is counter-intuitive.

Before being rude like this on the forums is counterproductive and against our code of conduct. May I remind you that all of this is for free: the software, the course, the documentation and the help.

You could also have checked the course on the tools you are using which introduces the arguments to use.

I wasn’t the one who started this. You wrote “as is clearly documented,” implying I missed something obvious, so I had to clarify that what you subjectively think is obvious was not obvious to others.

Directing me to the course isn’t helpful. I’m already drowning in too much documentation.

1 Like

That’s how HuggingFace has been; not to mention a tendency of the authors to be hostile towards people with feedback on how they can make things accessible.

The problem is that HF markets itself very well, which leads to many publications to publish models here rather than using standalone clean modules. A pity, but a monopoly over NLP models here does at least provide a platform for easy access.

There’s also a lack of testing of their own code, often missing obvious bugs and snippets that won’t even let modules load; Again, they can market they have the latest models while not really providing them in working quality, calling it “WIP”

Its a startup in the end, so scummy tactics and angry employees aren’t really below the expectations. What I would advise is be thankful for what you get already :pray:

1 Like