I believe in bash it would be export WANDB_WATCH=all
Hello everyone,
Although I really like W&B integration I found it creates some issues if one uses multiple trainers. It seems that W&B experiment session is tied to a lifecycle of a Python process rather than a Trainer. Current integration calls wandb.init just ones and doesnât deregister watch on the model at the end of the training session.
Thoughts @boris ?
Hi @vblagoje,
That is correct that the integration with HuggingFace considers that a typical run is linked to one process. However the intent is to allow full flexibility for custom cases with environmental variables.
You can for example resume a previous run from a different process (see resuming docs).
Would you be able to make a simple colab/notebook so I understand better you use case?
Hey @boris,
Thanks for the prompt response. My use case is here In a nutshell, I want (need) to use two Trainer instances sequentially in a single process. The training actually works ok if I completely turn off W&B, but I donât want to turn it off, as it gives me excellent insights into the training progress.
I first though about patching trainer callbacks to call unwatch and to call init on each on_train_begin
Trainer callback but then I decided to ask here first for an advice.
You could try calling wandb.finish()
between your runs.
Also you can always set up your runs manually with wandb.init(**optional_args)
, in which case the wandb.init
call from the Trainer
will be ignored.
Ok thatâs awesome, Iâll figure out something. Thanks @boris, have a great weekend!
Hey @boris ,
I played with the API and got everything I wanted out of it, except calling the unwatch. Even though I call wandb.finish between the training runs model is still being watched and the subsequent training run in the same process fails. There are workarounds but still, I wonder why not just call unwatch?! Yet it somehow seems hidden from the main wandb API.
Thoughts?
Hi @vblagoje,
Thatâs a very good point. I typically have separate runs during training so never encountered this issue.
Does it work as intended if you use âwandb.unwatch()â.
We can definitely add it in the integration.
Also can you confirm your intended use:
- you want to train without watching?
- you want to have 2 training loops where you donât watch the second loop?
- you donât want to watch at all (in this case there is an environment variable)
Feel free if you have any script or notebook to illustrate your issue.
Boris,
I donât know how to import unwatch; it seems unexposed on the level with other methods. It is not essential to watch the model but if I donât turn it off with env vars then the training breaks:
Traceback (most recent call last):
File "run_bert_pretraining.py", line 362, in <module>
main()
File "run_bert_pretraining.py", line 349, in main
trainer.train(model_path=training_args.output_dir if trainer_state else None)
File "/home/paperspace/transformers/src/transformers/trainer.py", line 688, in train
self.control = self.callback_handler.on_train_begin(self.args, self.state, self.control)
File "/home/paperspace/transformers/src/transformers/trainer_callback.py", line 329, in on_train_begin
return self.call_event("on_train_begin", args, state, control)
File "/home/paperspace/transformers/src/transformers/trainer_callback.py", line 376, in call_event
**kwargs,
File "/home/paperspace/transformers/src/transformers/integrations.py", line 283, in on_train_begin
self.setup(args, state, model)
File "/home/paperspace/transformers/src/transformers/integrations.py", line 279, in setup
wandb.watch(model, log=os.getenv("WANDB_WATCH", "gradients"), log_freq=max(100, args.logging_steps))
File "/home/paperspace/miniconda3/envs/hf-36/lib/python3.6/site-packages/wandb/sdk/wandb_watch.py", line 75, in watch
model, criterion, graph_idx=global_idx
File "/home/paperspace/miniconda3/envs/hf-36/lib/python3.6/site-packages/wandb/wandb_torch.py", line 295, in hook_torch
graph.hook_torch_modules(model, criterion, graph_idx=graph_idx)
File "/home/paperspace/miniconda3/envs/hf-36/lib/python3.6/site-packages/wandb/wandb_torch.py", line 338, in hook_torch_modules
"You can only call `wandb.watch` once per model. Pass a new instance of the model if you need to call wandb.watch again in your code.")
ValueError: You can only call `wandb.watch` once per model. Pass a new instance of the model if you need to call wandb.watch again in your code.
So yes, I want to have two training loops and pass the model between them. Once the first training finishes, the second continues. Thatâs when I get the error above ^.
You can easily replicate the issue by having two trainer instances in a single process both calling train sequentially. LMK if this clarification helps.
Thank you,
Vladimir
In that case you could just use the environmental variable so that it does not automatically watch the model.
Then you can watch it manually (do it only once):
import wandb
# after your first loop
wandb.watch(my_model)
It should still be watched on your second training loop and there should not be any issue.
Feel free to share your script if you still have issues.
Iâm now working on saving trained models as artifacts.
Please join this PR if you have any comment.
Hello. I am using wandb for training BART on my dataset. However, the number of steps logged and the number of samples in the training data does not match.
As an example, I have 79 datapoints for training and 9 for validation. I log every step and my batch size is 2. The wandb logger logs 163 steps for the loss. I validate every half epoch and the validation occurs at steps 41, 82, 123, 164. Moreover, when validating the loss makes a step with 3 datapoints (e.g. from 40 to 43 as if the batch size was 3). Do you know why this is the case? I would expect 158 steps for the loss and validation at 40, 80, 120, 158.
The script I am using is finetune.py
from examples/seq2seq
!python3 finetune.py \
--model_name_or_path facebook/bart-large-cnn \
--tokenizer_name facebook/bart-large-cnn \
--data_dir $data_dir \
--learning_rate 3e-5 --label_smoothing 0.1 --num_train_epochs 2 \
--sortish_sampler --freeze_embeds --adafactor \
--task summarization \
--do_train \
--max_source_length 1024 \
--max_target_length $SUMMARY_MAX_LEN \
--val_max_target_length $SUMMARY_MAX_LEN \
--test_max_target_length $SUMMARY_MAX_LEN \
--train_batch_size 2 --eval_batch_size 2 \
--eval_beams 2 \
--val_check_interval 0.5 \
--log_every_n_steps 1 \
--logger_name wandb \
--output_dir $output_dir \
--overwrite_output_dir \
--gpus 1 \
--seed $SEED
Hi @marcoabrate,
W&B uses trainer.step.global_step
to log each step so it should match the number of steps calculated by your Trainer
depending on the settings.
This script seems to be using WandbLogger
from Pytorch Lightning so in this case it will match the steps from the PL loop (which should match the length of the dataloader).
@boris I ran into an issue when I create a python subprocess and use it for training. Long story short, I added a transformer-cli command for training and invoke a training subprocess with:
popen = subprocess.Popen(command, stdout=out, stderr=err)
I get the following error at the end of the training.
If I add a return statement at the beginning of integrations.py on_train_end for wandb everything works ok.
How can I (we) fix this?
Best,
Vladimir
@boris even if I remove self._wandb.log({})
which causes the failure referencing self._log_model
in the next LOC also causes the following failure:
AttributeError: : AttributeErrorâWandbCallbackâ object has no attribute â_log_modelââWandbCallbackâ object has no attribute â_log_modelâ
What works is initializing self._log_model
to False inside the init method, just as you do with other instance variables. That and (re)moving self._wandb.log({})
elsewhere.
Cheers
FYI, the callback handler also fails with the same error when running SQuAD in distributed mode. I suspect that it potentially fails in similar distributed training setups. The workaround is to disable W&B with WANDB_DISABLED env flag.
Can your previous workaround solve the issue in this case as well?
The previous problem I solved by making a custom callback handler identical to W&B but without on_train_end implementation and without WANDB_DISABLED flag check. So I add that handler, turn off the default with env flag We can live with this workaround but it would be great to look at this and potentially find another solution that covers these edge cases.
Noted, can you file a bug with these issues?
If you have a reproducible example itâs even better.
I want the callback to work in every situation.
I released a new demo for optimizing models with W&B.
Feel free if you have any questions!