Logging & Experiment tracking with W&B

boris · October 1, 2020, 7:58pm

I believe in bash it would be export WANDB_WATCH=all

vblagoje · October 17, 2020, 5:39pm

Hello everyone,

Although I really like W&B integration I found it creates some issues if one uses multiple trainers. It seems that W&B experiment session is tied to a lifecycle of a Python process rather than a Trainer. Current integration calls wandb.init just ones and doesn’t deregister watch on the model at the end of the training session.

Thoughts @boris ?

boris · October 17, 2020, 6:43pm

Hi @vblagoje,

That is correct that the integration with HuggingFace considers that a typical run is linked to one process. However the intent is to allow full flexibility for custom cases with environmental variables.

You can for example resume a previous run from a different process (see resuming docs).

Would you be able to make a simple colab/notebook so I understand better you use case?

vblagoje · October 17, 2020, 7:19pm

Hey @boris,

Thanks for the prompt response. My use case is here In a nutshell, I want (need) to use two Trainer instances sequentially in a single process. The training actually works ok if I completely turn off W&B, but I don’t want to turn it off, as it gives me excellent insights into the training progress.

I first though about patching trainer callbacks to call unwatch and to call init on each on_train_begin Trainer callback but then I decided to ask here first for an advice.

boris · October 17, 2020, 7:39pm

You could try calling wandb.finish() between your runs.

Also you can always set up your runs manually with wandb.init(**optional_args), in which case the wandb.init call from the Trainer will be ignored.

vblagoje · October 17, 2020, 7:55pm

Ok that’s awesome, I’ll figure out something. Thanks @boris, have a great weekend!

vblagoje · October 23, 2020, 11:05am

Hey @boris ,

I played with the API and got everything I wanted out of it, except calling the unwatch. Even though I call wandb.finish between the training runs model is still being watched and the subsequent training run in the same process fails. There are workarounds but still, I wonder why not just call unwatch?! Yet it somehow seems hidden from the main wandb API.

Thoughts?

boris · October 23, 2020, 2:23pm

Hi @vblagoje,

That’s a very good point. I typically have separate runs during training so never encountered this issue.
Does it work as intended if you use ’wandb.unwatch()’.
We can definitely add it in the integration.

Also can you confirm your intended use:

you want to train without watching?
you want to have 2 training loops where you don’t watch the second loop?
you don’t want to watch at all (in this case there is an environment variable)

Feel free if you have any script or notebook to illustrate your issue.

vblagoje · October 23, 2020, 6:09pm

Boris,

I don’t know how to import unwatch; it seems unexposed on the level with other methods. It is not essential to watch the model but if I don’t turn it off with env vars then the training breaks:

Traceback (most recent call last):
  File "run_bert_pretraining.py", line 362, in <module>
    main()
  File "run_bert_pretraining.py", line 349, in main
    trainer.train(model_path=training_args.output_dir if trainer_state else None)
  File "/home/paperspace/transformers/src/transformers/trainer.py", line 688, in train
    self.control = self.callback_handler.on_train_begin(self.args, self.state, self.control)
  File "/home/paperspace/transformers/src/transformers/trainer_callback.py", line 329, in on_train_begin
    return self.call_event("on_train_begin", args, state, control)
  File "/home/paperspace/transformers/src/transformers/trainer_callback.py", line 376, in call_event
    **kwargs,
  File "/home/paperspace/transformers/src/transformers/integrations.py", line 283, in on_train_begin
    self.setup(args, state, model)
  File "/home/paperspace/transformers/src/transformers/integrations.py", line 279, in setup
    wandb.watch(model, log=os.getenv("WANDB_WATCH", "gradients"), log_freq=max(100, args.logging_steps))
  File "/home/paperspace/miniconda3/envs/hf-36/lib/python3.6/site-packages/wandb/sdk/wandb_watch.py", line 75, in watch
    model, criterion, graph_idx=global_idx
  File "/home/paperspace/miniconda3/envs/hf-36/lib/python3.6/site-packages/wandb/wandb_torch.py", line 295, in hook_torch
    graph.hook_torch_modules(model, criterion, graph_idx=graph_idx)
  File "/home/paperspace/miniconda3/envs/hf-36/lib/python3.6/site-packages/wandb/wandb_torch.py", line 338, in hook_torch_modules
    "You can only call `wandb.watch` once per model.  Pass a new instance of the model if you need to call wandb.watch again in your code.")
ValueError: You can only call `wandb.watch` once per model.  Pass a new instance of the model if you need to call wandb.watch again in your code.

So yes, I want to have two training loops and pass the model between them. Once the first training finishes, the second continues. That’s when I get the error above ^.

You can easily replicate the issue by having two trainer instances in a single process both calling train sequentially. LMK if this clarification helps.

Thank you,
Vladimir

boris · October 24, 2020, 9:32pm

In that case you could just use the environmental variable so that it does not automatically watch the model.

Then you can watch it manually (do it only once):

import wandb

# after your first loop
wandb.watch(my_model)

It should still be watched on your second training loop and there should not be any issue.

Feel free to share your script if you still have issues.

boris · October 28, 2020, 4:56pm

I’m now working on saving trained models as artifacts.
Please join this PR if you have any comment.

marcoabrate · December 8, 2020, 10:53am

Hello. I am using wandb for training BART on my dataset. However, the number of steps logged and the number of samples in the training data does not match.

As an example, I have 79 datapoints for training and 9 for validation. I log every step and my batch size is 2. The wandb logger logs 163 steps for the loss. I validate every half epoch and the validation occurs at steps 41, 82, 123, 164. Moreover, when validating the loss makes a step with 3 datapoints (e.g. from 40 to 43 as if the batch size was 3). Do you know why this is the case? I would expect 158 steps for the loss and validation at 40, 80, 120, 158.

The script I am using is finetune.py from examples/seq2seq

!python3 finetune.py \
--model_name_or_path facebook/bart-large-cnn \
--tokenizer_name facebook/bart-large-cnn \
--data_dir $data_dir \
--learning_rate 3e-5 --label_smoothing 0.1 --num_train_epochs 2 \
--sortish_sampler --freeze_embeds --adafactor \
--task summarization \
--do_train \
--max_source_length 1024 \
--max_target_length $SUMMARY_MAX_LEN \
--val_max_target_length $SUMMARY_MAX_LEN \
--test_max_target_length $SUMMARY_MAX_LEN \
--train_batch_size 2 --eval_batch_size 2 \
--eval_beams 2 \
--val_check_interval 0.5 \
--log_every_n_steps 1 \
--logger_name wandb \
--output_dir $output_dir \
--overwrite_output_dir \
--gpus 1 \
--seed $SEED

boris · December 9, 2020, 2:27pm

Hi @marcoabrate,

W&B uses trainer.step.global_step to log each step so it should match the number of steps calculated by your Trainer depending on the settings.
This script seems to be using WandbLogger from Pytorch Lightning so in this case it will match the steps from the PL loop (which should match the length of the dataloader).

vblagoje · January 20, 2021, 3:32pm

@boris I ran into an issue when I create a python subprocess and use it for training. Long story short, I added a transformer-cli command for training and invoke a training subprocess with:
popen = subprocess.Popen(command, stdout=out, stderr=err)

I get the following error at the end of the training.

If I add a return statement at the beginning of integrations.py on_train_end for wandb everything works ok.

How can I (we) fix this?

Best,
Vladimir

vblagoje · January 20, 2021, 9:43pm

@boris even if I remove self._wandb.log({}) which causes the failure referencing self._log_model in the next LOC also causes the following failure:
AttributeError: : AttributeError’WandbCallback’ object has no attribute ‘_log_model’‘WandbCallback’ object has no attribute ‘_log_model’

What works is initializing self._log_model to False inside the init method, just as you do with other instance variables. That and (re)moving self._wandb.log({}) elsewhere.

Cheers

vblagoje · January 25, 2021, 10:58am

FYI, the callback handler also fails with the same error when running SQuAD in distributed mode. I suspect that it potentially fails in similar distributed training setups. The workaround is to disable W&B with WANDB_DISABLED env flag.

boris · January 25, 2021, 1:12pm

Can your previous workaround solve the issue in this case as well?

vblagoje · January 25, 2021, 1:34pm

The previous problem I solved by making a custom callback handler identical to W&B but without on_train_end implementation and without WANDB_DISABLED flag check. So I add that handler, turn off the default with env flag We can live with this workaround but it would be great to look at this and potentially find another solution that covers these edge cases.

boris · January 25, 2021, 3:14pm

Noted, can you file a bug with these issues?
If you have a reproducible example it’s even better.
I want the callback to work in every situation.

boris · February 5, 2021, 5:26am

I released a new demo for optimizing models with W&B.

Feel free if you have any questions!

Topic		Replies	Views
W&B Support for the HF Flax Community Week Flax/JAX Projects	0	700	July 6, 2021
📣 Weights & Biases - Feedback 🤗Transformers	2	627	December 5, 2022
WandB does not log train loss Beginners	0	64	November 7, 2024
Wandb does not display train/eval loss except for last one Beginners	2	3589	March 4, 2022
Fine tuning Wav2vec for wolof Beginners	10	538	November 30, 2021

Logging & Experiment tracking with W&B

Related topics