Logging & Experiment tracking with W&B

Hi @boris, thanks for sharing such a great tutorial!

One question I have is whether you know how one can suppress the run summary output that comes from executing wandb.finish() in a Jupyter notebook:

I find this output makes my notebooks very messy, especially when running a hyperparameter search with the HF Trainer.

I know the WandB docs suggest the following to suppress info messages

import logging
logger = logging.getLogger("wandb")
logger.setLevel(logging.ERROR)

but this doesn’t seem to remove the run summaries. Thanks!

Usually you could just use %%capture at the top of your cell but for some reason it does not work here.

Another solution is to just use:

from IPython.display import clear_output
clear_output()
1 Like

Thanks for the tip @boris!

1 Like

FYI I found a more elegant way to suppress the output from the technical FAQ of the W&B docs: Technical FAQ - Documentation

Basically just run

%env WANDB_SILENT true

before loading W&B

1 Like

At the moment, models are saved at the end of training.
Does anyone see a need for being able to save the models as they train?

1 Like

only if CPU usage is keep low.

Also was thinking that it is a sync operation, thinking of slow connections that can take quite some time to upload 1Gb to cloud.

Also it will be nice if a progress bar is made. I had a cell with total running time of 6 hours, model trained in 2, the rest was of wandb finish, but at end stoped the cell and wand db detected “ctrl-c” and reported that wandb --data was interrupted, but I think my data was already on the cloud and finish just got stuck somewhere.

Upload is async so it should not interfere with your training.

Hey @boris, I am stuck with an issue and hope you might be able to resolve it. I am currently using transformers (in particular the Trainer API) to run sweeps using the CLI. The Trainer saves model checkpoints every save_steps under output_dir. Since my training procedure takes quite some time, I am wondering how would I be able to resume preempted sweeps in the context of the Trainer API, i.e. by loading previous checkpoints from output_dir?

Currently, I thought about something like this, however, I am not sure if it would work:

if wandb.run.resumed and any(Path(f'models\\checkpoints_{wandb.config.gradient_accumulation_steps}_{wandb.config.learning_rate}').iterdir()):
        trainer.train(resume_from_checkpoint=True)
    else:
        trainer.train()

Hi @simonschoe
Are your models auto-logged as artifacts (with WANDB_LOG_MODEL)?
If so, you don’t even have to manage your models locally as you can easily redownload the best model from your sweep later.

No, not yet. I’d first like to make it work locally and eventually move over to storing checkpoints online along the way.

Ok, in case you want to do it, just use:
os.environ['WANDB_LOG_MODEL'] = 'true'

As for resuming from a checkpoint, I believe you need to use trainer.train(path) here.

Thanks for your reply! Do you know if the setting os.environ['WANDB_LOG_MODEL'] = 'true' also ensures that model checkpoints are uploaded/logged along the way? For example, I’d like to store model checkpoints every x steps. However, from the artifacts documentation I infered that artifacts are generally stored after a succesful training run, i.e. at the end, is this correct?

At the moment it’s only at the end of training however I opened an issue to be able to log models continuously.

1 Like

hey guys, It seems that eval loss is not being logged onto wandb (No loss being logged, when running MLM script (Colab) - #3 by Neel-Gupta)

And it looks like when the code cell for pre-training has run, it just keeps running even after the model has been trained.
That is, it apparently never executes next to wandb.finish() and keeps on going.

Any ideas on how I can solve these issues?

@boris any updates on @Neel-Gupta 's point? The validation loss inside Trainer() isn’t logged to wandb.

1 Like

Looks like his post was answered.
What matters is that wandb logs any metric that is produced by the Trainer, so you need to make sure you use the correct arguments for the Trainer (in particular requesting which type of evaluation strategy and the interval).

I have a trainer in which I’m overriding the evaluate method to inject some custom functionality. In particular, I’m computing some metrics “by hand” in the evaluate method, returning them in the appropriate Dict[str, float] object. However, these metrics aren’t being logged to W&B. Is this because they’re not being computed via the compute_metrics function typically passed to Trainer? I’m invoking W&B in the simplest way possible here, just passing report_to="wandb" in TrainingArguments.

You will need to manually log them to ahve them reorted to wandb (with the logmethod of the Trainer).

1 Like

If metrics: Dict[str, float] is the object I’m returning from evaluate, is it as simple as just invoking self.log(metrics)?

Update: It seems it is that simple! (Correct me if I’m wrong, @sgugger!)

1 Like

You just have to do this indeed :slight_smile:

2 Likes