Will the wandb api run on notebook without internet access?
What are you trying to do? You can run wandb
offline. To do this:
- Set the environment variable
WANDB_MODE=offline
to save the metrics locally, no internet required. - When youâre ready, run
wandb init
in your directory to set the project name. - Run
wandb sync YOUR_RUN_DIRECTORY
to push the metrics to our cloud service and see your results in our hosted web app.
If you are looking to query wandb using the wandb API, you would need internet access to query your data. Import & Export Data - Documentation
If you are hosting your own wandb locally using wandb local
, you could query this with access to locally hosted servers only.wandb.apis.public.Api - Documentation
I am finetuning multiple models using for loop as follows.
for file in os.listdir(args.data_dir):
finetune(args, file)
BUT wandb
shows logs only for the first file in data_dir
although it is training and saving models for other files. It feels very strange behavior.
wandb: Synced bertweet-base-finetuned-file1: https://wandb.ai/***/huggingface/runs/***
This is a small snippet of finetuning code with Huggingface:
def finetune(args, file):
training_args = TrainingArguments(
output_dir=f'{model_name}-finetuned-{file}',
overwrite_output_dir=True,
evaluation_strategy='no',
num_train_epochs=args.epochs,
learning_rate=args.lr,
weight_decay=args.decay,
per_device_train_batch_size=args.batch_size,
per_device_eval_batch_size=args.batch_size,
fp16=True, # mixed-precision training to boost speed
save_strategy='no',
seed=args.seed,
dataloader_num_workers=4,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset['train'],
eval_dataset=None,
data_collator=data_collator,
)
trainer.train()
trainer.save_model()
This is a duplicate of this post: Wandb for Huggingface Trainer saves only first model - W&B Help - W&B Community in the W&B forum
Hereâs the reply:
youâve set
save_strategy
to NO in your code to avoid saving anything. This would only save the final model once training is done withtrainer.save_model()
. You can update it tosave_strategy="epoch"
and it will save the model with every epoch.
Or, in order to log models, you could also set the env var
WANDB_LOG_MODEL
as specified in our docs here. Once you set this env var, any Trainer you initialize from now on will upload models to your W&B project. Note that your model will be saved to W&B Artifacts asrun-{run_name}
.
wandb.init(reinit=True)
and run.finish()
helped me to log the models separately on wandb website.
The working code looks like below:
for file in os.listdir(args.data_dir):
finetune(args, file)
import wandb
def finetune(args, file):
run = wandb.init(reinit=True)
...
run.finish()
Reference: Launch Experiments with wandb.init - Documentation
Thanks for sharing an update @krishnagarg09
Hi! Hope this is the right place to ask thisâŚ
For whatever reason, in my environment, when running the run_summarization.py
script I get the following error:
wandb.errors.UsageError: Error communicating with wandb process
try: wandb.init(settings=wandb.Settings(start_method='fork'))
or: wandb.init(settings=wandb.Settings(start_method='thread'))
For more info see: https://docs.wandb.ai/library/init#init-start-error
adding settings=wandb.Settings(start_method='fork')
to wandb.init
does seem to fix the problem for me. Is there a way to specify this as an argument to scripts like run_summarization.py
? (want to avoid modifying the script if possible).
Hey @johngiorgi I work at Weights & Biases, glad you got a fix working. Iâm curious are you training across multiple machines? Or is there anything unusual in your system setup that might prevent wandb communicating with the server?
Hi @morgan! I am not training across multiple machines (in fact Iâm not even training across multiple GPUs for the time being). I donât think it has to do with my system or environment. I am running on the ARC clusters and following their minimal example can get W&B to work fine. I only get problems when I try to use the example run_summarization.py
script from HF (I havenât tried other run_*.py
scripts but sort of expect the same issue).
is there an official answer to this? what about just using the callback?
see the report_to option
:
training_args = Seq2SeqTrainingArguments(
output_dir="./results", # todo change
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
weight_decay=0.01,
save_total_limit=3,
num_train_epochs=1,
fp16=fp16,
report_to=report_to,
)
Yep passing âwandbâ should work well
report_to="wandb"
Should work when using the Trainer
Also make sure up update to wandb 0.13 (just released last week) as it massively improves performance and support for distributed training
pip install wandb --upgrade
@morgan now I am having the issue that in distributed training all 8 workers are logging. I get 8 runs in my wandb is the issue resolved only by having the main process init wanbd? How do I know who is the main wandb process if I am using only thre trainer api?
solution: Is wandb in Trainer configured for distributed training? - #3 by brando
Here is some documentation about logging within a distributed training setting: Distributed Training - Documentation
Summary:
- Log on the main process and get one run
- Log on all processes and 1 run per process, use
group
param to group these runs
Iâm a bit out of my depth in how to tell HF to log only on the main process. A few things to try:
-
log_on_each_node
inTrainerArguments
seems to do what you want - The
WandbCallback
already checks if itâs on the main process before logging: transformers/integrations.py at main ¡ huggingface/transformers ¡ GitHub
so you can use that. - If you want to customise logging, you can define your own
on_log
method within aTrainerCallback
and use thestate
param to determine which process itâs on, like in theWandbCallback
linked above.
I forgot to init wandb for my (long-ish) train run and would like to be able to export the metrics I have computed to my project. Is there a simple way to export all the metrics from my transformers training run that are dumped in checkpoint-*/trainer_state.json
to wandb?
@sgugger how use wands for the fine-tuning of segment anything model,the tutorial are about the nlp but computer vision there are no tutorialâŚ
@boris
Hi everyone
has anyone had this kind of error?
Iâm just using wandb.login() and report_to=âwandbâ in TrainingArguments
Thanks for this awesome integration. I have a question about customizing the logging variables.
Using Trainer, suppose forward computes two losses: loss_1, and loss_2, and the model loss is the sum of loss_1 and loss_2, how can I log two parts separately along with the default (total) loss at the logging step?
For example, forward returns a dictionary of loss, loss_1 and loss_2.
Hello.
I donât know if someone tried to run this script or not to resume training from a checkpoint (checkpoint-script), but I think the argument passed to run.use_artifact()
should be my_checkpoint_name
instead of my_model_name
as suggested in the docs. (wandb-docs)
last_run_id = "xxxxxxxx" # fetch the run_id from your wandb workspace
# resume the wandb run from the run_id
with wandb.init(
project=os.environ["WANDB_PROJECT"],
id=last_run_id,
resume="must",
) as run:
# Connect an Artifact to the run
my_checkpoint_name = f"checkpoint-{last_run_id}:latest"
my_checkpoint_artifact = run.use_artifact(my_checkpoint_name) # should not be my_model_name
.
.
.