Also was thinking that it is a sync operation, thinking of slow connections that can take quite some time to upload 1Gb to cloud.
Also it will be nice if a progress bar is made. I had a cell with total running time of 6 hours, model trained in 2, the rest was of wandb finish, but at end stoped the cell and wand db detected âctrl-câ and reported that wandb --data was interrupted, but I think my data was already on the cloud and finish just got stuck somewhere.
Hey @boris, I am stuck with an issue and hope you might be able to resolve it. I am currently using transformers (in particular the Trainer API) to run sweeps using the CLI. The Trainer saves model checkpoints every save_steps under output_dir. Since my training procedure takes quite some time, I am wondering how would I be able to resume preempted sweeps in the context of the Trainer API, i.e. by loading previous checkpoints from output_dir?
Currently, I thought about something like this, however, I am not sure if it would work:
if wandb.run.resumed and any(Path(f'models\\checkpoints_{wandb.config.gradient_accumulation_steps}_{wandb.config.learning_rate}').iterdir()):
trainer.train(resume_from_checkpoint=True)
else:
trainer.train()
Hi @simonschoe
Are your models auto-logged as artifacts (with WANDB_LOG_MODEL)?
If so, you donât even have to manage your models locally as you can easily redownload the best model from your sweep later.
Thanks for your reply! Do you know if the setting os.environ['WANDB_LOG_MODEL'] = 'true' also ensures that model checkpoints are uploaded/logged along the way? For example, Iâd like to store model checkpoints every x steps. However, from the artifacts documentation I infered that artifacts are generally stored after a succesful training run, i.e. at the end, is this correct?
And it looks like when the code cell for pre-training has run, it just keeps running even after the model has been trained.
That is, it apparently never executes next to wandb.finish() and keeps on going.
Looks like his post was answered.
What matters is that wandb logs any metric that is produced by the Trainer, so you need to make sure you use the correct arguments for the Trainer (in particular requesting which type of evaluation strategy and the interval).
I have a trainer in which Iâm overriding the evaluate method to inject some custom functionality. In particular, Iâm computing some metrics âby handâ in the evaluate method, returning them in the appropriate Dict[str, float] object. However, these metrics arenât being logged to W&B. Is this because theyâre not being computed via the compute_metrics function typically passed to Trainer? Iâm invoking W&B in the simplest way possible here, just passing report_to="wandb" in TrainingArguments.