Logging & Experiment tracking with W&B

pchhapolika · January 7, 2022, 5:48am

Will the wandb api run on notebook without internet access?

scottire · January 10, 2022, 12:04pm

What are you trying to do? You can run wandb offline. To do this:

Set the environment variable WANDB_MODE=offline to save the metrics locally, no internet required.
When you’re ready, run wandb init in your directory to set the project name.
Run wandb sync YOUR_RUN_DIRECTORY to push the metrics to our cloud service and see your results in our hosted web app.

If you are looking to query wandb using the wandb API, you would need internet access to query your data. Import & Export Data - Documentation

If you are hosting your own wandb locally using wandb local, you could query this with access to locally hosted servers only.wandb.apis.public.Api - Documentation

krishnagarg09 · April 19, 2022, 2:49pm

I am finetuning multiple models using for loop as follows.

for file in os.listdir(args.data_dir):
    finetune(args, file)

BUT wandb shows logs only for the first file in data_dir although it is training and saving models for other files. It feels very strange behavior.

wandb: Synced bertweet-base-finetuned-file1: https://wandb.ai/***/huggingface/runs/***

This is a small snippet of finetuning code with Huggingface:

def finetune(args, file):
    training_args = TrainingArguments(
        output_dir=f'{model_name}-finetuned-{file}',
        overwrite_output_dir=True,
        evaluation_strategy='no',
        num_train_epochs=args.epochs,
        learning_rate=args.lr,
        weight_decay=args.decay,
        per_device_train_batch_size=args.batch_size,
        per_device_eval_batch_size=args.batch_size,
        fp16=True, # mixed-precision training to boost speed
        save_strategy='no',
        seed=args.seed,
        dataloader_num_workers=4,
    )

    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_dataset['train'],
        eval_dataset=None,
        data_collator=data_collator,
    )
    trainer.train()
    trainer.save_model()

scottire · April 21, 2022, 9:43am

This is a duplicate of this post: Wandb for Huggingface Trainer saves only first model - W&B Help - W&B Community in the W&B forum

Here’s the reply:

you’ve set save_strategy to NO in your code to avoid saving anything. This would only save the final model once training is done with trainer.save_model() . You can update it to save_strategy="epoch" and it will save the model with every epoch.

Or, in order to log models, you could also set the env var WANDB_LOG_MODEL as specified in our docs here. Once you set this env var, any Trainer you initialize from now on will upload models to your W&B project. Note that your model will be saved to W&B Artifacts as run-{run_name} .

krishnagarg09 · April 21, 2022, 2:50pm

wandb.init(reinit=True) and run.finish() helped me to log the models separately on wandb website.

The working code looks like below:


for file in os.listdir(args.data_dir):
    finetune(args, file)

import wandb
def finetune(args, file):
    run = wandb.init(reinit=True)
    ...
    run.finish()

Reference: Launch Experiments with wandb.init - Documentation

scottire · April 21, 2022, 2:51pm

Thanks for sharing an update @krishnagarg09

johngiorgi · May 10, 2022, 11:44pm

Hi! Hope this is the right place to ask this…

For whatever reason, in my environment, when running the run_summarization.py script I get the following error:

wandb.errors.UsageError: Error communicating with wandb process
try: wandb.init(settings=wandb.Settings(start_method='fork'))
or:  wandb.init(settings=wandb.Settings(start_method='thread'))
For more info see: https://docs.wandb.ai/library/init#init-start-error

adding settings=wandb.Settings(start_method='fork') to wandb.init does seem to fix the problem for me. Is there a way to specify this as an argument to scripts like run_summarization.py? (want to avoid modifying the script if possible).

morgan · May 12, 2022, 10:09am

Hey @johngiorgi I work at Weights & Biases, glad you got a fix working. I’m curious are you training across multiple machines? Or is there anything unusual in your system setup that might prevent wandb communicating with the server?

johngiorgi · May 18, 2022, 10:01pm

Hi @morgan! I am not training across multiple machines (in fact I’m not even training across multiple GPUs for the time being). I don’t think it has to do with my system or environment. I am running on the ARC clusters and following their minimal example can get W&B to work fine. I only get problems when I try to use the example run_summarization.py script from HF (I haven’t tried other run_*.py scripts but sort of expect the same issue).

brando · August 9, 2022, 9:10pm

is there an official answer to this? what about just using the callback?

brando · August 12, 2022, 4:30pm

see the report_to option:

training_args = Seq2SeqTrainingArguments(
    output_dir="./results",  # todo change
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=1,
    fp16=fp16,
    report_to=report_to,
)

morgan · August 16, 2022, 1:55pm

Yep passing “wandb” should work well

report_to="wandb"

Should work when using the Trainer

Also make sure up update to wandb 0.13 (just released last week) as it massively improves performance and support for distributed training

pip install wandb --upgrade

brando · August 18, 2022, 3:03pm

@morgan now I am having the issue that in distributed training all 8 workers are logging. I get 8 runs in my wandb is the issue resolved only by having the main process init wanbd? How do I know who is the main wandb process if I am using only thre trainer api?

solution: Is wandb in Trainer configured for distributed training? - #3 by brando

scottire · August 25, 2022, 10:03am

Here is some documentation about logging within a distributed training setting: Distributed Training - Documentation

Summary:

Log on the main process and get one run
Log on all processes and 1 run per process, use group param to group these runs

I’m a bit out of my depth in how to tell HF to log only on the main process. A few things to try:

log_on_each_node in TrainerArguments seems to do what you want
The WandbCallback already checks if it’s on the main process before logging: transformers/integrations.py at main · huggingface/transformers · GitHub
so you can use that.
If you want to customise logging, you can define your own on_log method within a TrainerCallback and use the state param to determine which process it’s on, like in the WandbCallback linked above.

rajhansneeva · September 3, 2022, 8:40pm

I forgot to init wandb for my (long-ish) train run and would like to be able to export the metrics I have computed to my project. Is there a simple way to export all the metrics from my transformers training run that are dumped in checkpoint-*/trainer_state.json to wandb?

sushmanth · July 13, 2023, 12:38pm

@sgugger how use wands for the fine-tuning of segment anything model,the tutorial are about the nlp but computer vision there are no tutorial…
@boris

RaphaelKalandadze · July 31, 2023, 3:26pm

Hi everyone
has anyone had this kind of error?
I’m just using wandb.login() and report_to=“wandb” in TrainingArguments

hzhiqi · July 31, 2023, 10:57pm

Thanks for this awesome integration. I have a question about customizing the logging variables.

Using Trainer, suppose forward computes two losses: loss_1, and loss_2, and the model loss is the sum of loss_1 and loss_2, how can I log two parts separately along with the default (total) loss at the logging step?

For example, forward returns a dictionary of loss, loss_1 and loss_2.

raunak45 · February 28, 2024, 6:46am

Hello.

I don’t know if someone tried to run this script or not to resume training from a checkpoint (checkpoint-script), but I think the argument passed to run.use_artifact() should be my_checkpoint_name instead of my_model_name as suggested in the docs. (wandb-docs)

last_run_id = "xxxxxxxx"  # fetch the run_id from your wandb workspace

# resume the wandb run from the run_id
with wandb.init(
    project=os.environ["WANDB_PROJECT"],
    id=last_run_id,
    resume="must",
) as run:
    # Connect an Artifact to the run
    my_checkpoint_name = f"checkpoint-{last_run_id}:latest"
    my_checkpoint_artifact = run.use_artifact(my_checkpoint_name) # should not be my_model_name
    .
    .
    .

Topic		Replies	Views
W&B Support for the HF Flax Community Week Flax/JAX Projects	0	700	July 6, 2021
📣 Weights & Biases - Feedback 🤗Transformers	2	627	December 5, 2022
WandB does not log train loss Beginners	0	62	November 7, 2024
Wandb does not display train/eval loss except for last one Beginners	2	3588	March 4, 2022
Fine tuning Wav2vec for wolof Beginners	10	538	November 30, 2021

Logging & Experiment tracking with W&B

Related topics