Noted, can you file a bug with these issues?
If you have a reproducible example itâs even better.
I want the callback to work in every situation.
Hi @boris, thanks for sharing such a great tutorial!
One question I have is whether you know how one can suppress the run summary output that comes from executing wandb.finish()
in a Jupyter notebook:
I find this output makes my notebooks very messy, especially when running a hyperparameter search with the HF Trainer
.
I know the WandB docs suggest the following to suppress info messages
import logging
logger = logging.getLogger("wandb")
logger.setLevel(logging.ERROR)
but this doesnât seem to remove the run summaries. Thanks!
Usually you could just use %%capture
at the top of your cell but for some reason it does not work here.
Another solution is to just use:
from IPython.display import clear_output
clear_output()
FYI I found a more elegant way to suppress the output from the technical FAQ of the W&B docs: Technical FAQ - Documentation
Basically just run
%env WANDB_SILENT true
before loading W&B
At the moment, models are saved at the end of training.
Does anyone see a need for being able to save the models as they train?
only if CPU usage is keep low.
Also was thinking that it is a sync operation, thinking of slow connections that can take quite some time to upload 1Gb to cloud.
Also it will be nice if a progress bar is made. I had a cell with total running time of 6 hours, model trained in 2, the rest was of wandb finish, but at end stoped the cell and wand db detected âctrl-câ and reported that wandb --data
was interrupted, but I think my data was already on the cloud and finish
just got stuck somewhere.
Upload is async so it should not interfere with your training.
Hey @boris, I am stuck with an issue and hope you might be able to resolve it. I am currently using transformers
(in particular the Trainer
API) to run sweeps using the CLI. The Trainer
saves model checkpoints every save_steps
under output_dir
. Since my training procedure takes quite some time, I am wondering how would I be able to resume preempted sweeps in the context of the Trainer
API, i.e. by loading previous checkpoints from output_dir
?
Currently, I thought about something like this, however, I am not sure if it would work:
if wandb.run.resumed and any(Path(f'models\\checkpoints_{wandb.config.gradient_accumulation_steps}_{wandb.config.learning_rate}').iterdir()):
trainer.train(resume_from_checkpoint=True)
else:
trainer.train()
Hi @simonschoe
Are your models auto-logged as artifacts (with WANDB_LOG_MODEL
)?
If so, you donât even have to manage your models locally as you can easily redownload the best model from your sweep later.
No, not yet. Iâd first like to make it work locally and eventually move over to storing checkpoints online along the way.
Ok, in case you want to do it, just use:
os.environ['WANDB_LOG_MODEL'] = 'true'
As for resuming from a checkpoint, I believe you need to use trainer.train(path)
here.
Thanks for your reply! Do you know if the setting os.environ['WANDB_LOG_MODEL'] = 'true'
also ensures that model checkpoints are uploaded/logged along the way? For example, Iâd like to store model checkpoints every x
steps. However, from the artifacts documentation I infered that artifacts are generally stored after a succesful training run, i.e. at the end, is this correct?
At the moment itâs only at the end of training however I opened an issue to be able to log models continuously.
hey guys, It seems that eval loss is not being logged onto wandb (No loss being logged, when running MLM script (Colab) - #3 by Neel-Gupta)
And it looks like when the code cell for pre-training has run, it just keeps running even after the model has been trained.
That is, it apparently never executes next to wandb.finish()
and keeps on going.
Any ideas on how I can solve these issues?
@boris any updates on @Neel-Gupta 's point? The validation loss inside Trainer()
isnât logged to wandb.
Looks like his post was answered.
What matters is that wandb logs any metric that is produced by the Trainer, so you need to make sure you use the correct arguments for the Trainer (in particular requesting which type of evaluation strategy and the interval).
I have a trainer in which Iâm overriding the evaluate
method to inject some custom functionality. In particular, Iâm computing some metrics âby handâ in the evaluate
method, returning them in the appropriate Dict[str, float]
object. However, these metrics arenât being logged to W&B. Is this because theyâre not being computed via the compute_metrics
function typically passed to Trainer
? Iâm invoking W&B in the simplest way possible here, just passing report_to="wandb"
in TrainingArguments
.
You will need to manually log them to ahve them reorted to wandb (with the log
method of the Trainer
).