Logging & Experiment tracking with W&B

Will the wandb api run on notebook without internet access?

What are you trying to do? You can run wandb offline. To do this:

  1. Set the environment variable WANDB_MODE=offline to save the metrics locally, no internet required.
  2. When you’re ready, run wandb init in your directory to set the project name.
  3. Run wandb sync YOUR_RUN_DIRECTORY to push the metrics to our cloud service and see your results in our hosted web app.

If you are looking to query wandb using the wandb API, you would need internet access to query your data. Import & Export Data - Documentation

If you are hosting your own wandb locally using wandb local, you could query this with access to locally hosted servers only.wandb.apis.public.Api - Documentation

1 Like

I am finetuning multiple models using for loop as follows.

for file in os.listdir(args.data_dir):
    finetune(args, file)

BUT wandb shows logs only for the first file in data_dir although it is training and saving models for other files. It feels very strange behavior.

wandb: Synced bertweet-base-finetuned-file1: https://wandb.ai/***/huggingface/runs/***

This is a small snippet of finetuning code with Huggingface:

def finetune(args, file):
    training_args = TrainingArguments(
        output_dir=f'{model_name}-finetuned-{file}',
        overwrite_output_dir=True,
        evaluation_strategy='no',
        num_train_epochs=args.epochs,
        learning_rate=args.lr,
        weight_decay=args.decay,
        per_device_train_batch_size=args.batch_size,
        per_device_eval_batch_size=args.batch_size,
        fp16=True, # mixed-precision training to boost speed
        save_strategy='no',
        seed=args.seed,
        dataloader_num_workers=4,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset['train'],
        eval_dataset=None,
        data_collator=data_collator,
    )
    trainer.train()
    trainer.save_model()

This is a duplicate of this post: Wandb for Huggingface Trainer saves only first model - W&B Help - W&B Community in the W&B forum

Here’s the reply:

you’ve set save_strategy to NO in your code to avoid saving anything. This would only save the final model once training is done with trainer.save_model() . You can update it to save_strategy="epoch" and it will save the model with every epoch.

Or, in order to log models, you could also set the env var WANDB_LOG_MODEL as specified in our docs here. Once you set this env var, any Trainer you initialize from now on will upload models to your W&B project. Note that your model will be saved to W&B Artifacts as run-{run_name} .

1 Like

wandb.init(reinit=True) and run.finish() helped me to log the models separately on wandb website.

The working code looks like below:


for file in os.listdir(args.data_dir):
    finetune(args, file)

import wandb
def finetune(args, file):
    run = wandb.init(reinit=True)
    ...
    run.finish()

Reference: Launch Experiments with wandb.init - Documentation

1 Like

Thanks for sharing an update @krishnagarg09

1 Like

Hi! Hope this is the right place to ask this…

For whatever reason, in my environment, when running the run_summarization.py script I get the following error:

wandb.errors.UsageError: Error communicating with wandb process
try: wandb.init(settings=wandb.Settings(start_method='fork'))
or:  wandb.init(settings=wandb.Settings(start_method='thread'))
For more info see: https://docs.wandb.ai/library/init#init-start-error

adding settings=wandb.Settings(start_method='fork') to wandb.init does seem to fix the problem for me. Is there a way to specify this as an argument to scripts like run_summarization.py? (want to avoid modifying the script if possible).

Hey @johngiorgi I work at Weights & Biases, glad you got a fix working. I’m curious are you training across multiple machines? Or is there anything unusual in your system setup that might prevent wandb communicating with the server?

Hi @morgan! I am not training across multiple machines (in fact I’m not even training across multiple GPUs for the time being). I don’t think it has to do with my system or environment. I am running on the ARC clusters and following their minimal example can get W&B to work fine. I only get problems when I try to use the example run_summarization.py script from HF (I haven’t tried other run_*.py scripts but sort of expect the same issue).

is there an official answer to this? what about just using the callback?

see the report_to option:

training_args = Seq2SeqTrainingArguments(
    output_dir="./results",  # todo change
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    fp16=fp16,
    report_to=report_to,
)

Yep passing “wandb” should work well

report_to="wandb"

Should work when using the Trainer

Also make sure up update to wandb 0.13 (just released last week) as it massively improves performance and support for distributed training

pip install wandb --upgrade
1 Like

@morgan now I am having the issue that in distributed training all 8 workers are logging. I get 8 runs in my wandb :frowning: is the issue resolved only by having the main process init wanbd? How do I know who is the main wandb process if I am using only thre trainer api?


solution: Is wandb in Trainer configured for distributed training? - #3 by brando

1 Like

Here is some documentation about logging within a distributed training setting: Distributed Training - Documentation

Summary:

  1. Log on the main process and get one run
  2. Log on all processes and 1 run per process, use group param to group these runs

I’m a bit out of my depth in how to tell HF to log only on the main process. A few things to try:

  1. log_on_each_node in TrainerArguments seems to do what you want
  2. The WandbCallback already checks if it’s on the main process before logging: transformers/integrations.py at main · huggingface/transformers · GitHub
    so you can use that.
  3. If you want to customise logging, you can define your own on_log method within a TrainerCallback and use the state param to determine which process it’s on, like in the WandbCallback linked above.

I forgot to init wandb for my (long-ish) train run and would like to be able to export the metrics I have computed to my project. Is there a simple way to export all the metrics from my transformers training run that are dumped in checkpoint-*/trainer_state.json to wandb?

@sgugger how use wands for the fine-tuning of segment anything model,the tutorial are about the nlp but computer vision there are no tutorial…
@boris

Hi everyone
has anyone had this kind of error?
I’m just using wandb.login() and report_to=“wandb” in TrainingArguments

Thanks for this awesome integration. I have a question about customizing the logging variables.

Using Trainer, suppose forward computes two losses: loss_1, and loss_2, and the model loss is the sum of loss_1 and loss_2, how can I log two parts separately along with the default (total) loss at the logging step?

For example, forward returns a dictionary of loss, loss_1 and loss_2.

Hello.

I don’t know if someone tried to run this script or not to resume training from a checkpoint (checkpoint-script), but I think the argument passed to run.use_artifact() should be my_checkpoint_name instead of my_model_name as suggested in the docs. (wandb-docs)

last_run_id = "xxxxxxxx"  # fetch the run_id from your wandb workspace

# resume the wandb run from the run_id
with wandb.init(
    project=os.environ["WANDB_PROJECT"],
    id=last_run_id,
    resume="must",
) as run:
    # Connect an Artifact to the run
    my_checkpoint_name = f"checkpoint-{last_run_id}:latest"
    my_checkpoint_artifact = run.use_artifact(my_checkpoint_name) # should not be my_model_name
    .
    .
    .