Using Tensorboard SummaryWriter with HuggingFace TrainerAPI

Anna-Kay · September 12, 2022, 11:27am

I am fine-tuning a HuggingFace transformer model (PyTorch version), using the HF Seq2SeqTrainingArguments & Seq2SeqTrainer, and I want to display in Tensorboard the train and validation losses (in the same chart).

As far as I understand in order to plot the two losses together I need to use the SummaryWriter. The HF Callbacks documenation describes a TensorBoardCallback function that can receive a tb_writer argument:

However, I cannot figure out what is the right way to use it, if it is even supposed to be used with the Trainer API.

My code looks something like this:

args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    evaluation_strategy='epoch',
    learning_rate= 1e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=num_train_epochs,
    predict_with_generate=True,
    logging_steps=logging_steps,
    report_to='tensorboard',
    push_to_hub=False,  
)

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_train_data,
    eval_dataset=tokenized_val_data,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

I am fine-tuning a HuggingFace transformer model (PyTorch version), using the HF Seq2SeqTrainingArguments & Seq2SeqTrainer, and I want to display in Tensorboard the train and validation losses (in the same chart).

As far as I understand in order to plot the two losses together I need to use the SummaryWriter. The HF Callbacks documenation describes a TensorBoardCallback function that can receive a tb_writer argument:

However, I cannot figure out what is the right way to use it, if it is even supposed to be used with the Trainer API.

My code looks something like this:

args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    evaluation_strategy='epoch',
    learning_rate= 1e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=num_train_epochs,
    predict_with_generate=True,
    logging_steps=logging_steps,
    report_to='tensorboard',
    push_to_hub=False,  
)

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_train_data,
    eval_dataset=tokenized_val_data,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

I would assume I should include the callback to TensorBoard in the trainer, e.g.,

callbacks = [TensorBoardCallback(tb_writer=tb_writer)]

but I cannot find a comprehensive example of how to use/what to import to use it.

I also found this feature request on GitHub,

github.com/huggingface/transformers

Pass existing tensorboard SummaryWriter to Trainer PR (#4019)

huggingface:master ← jaymody:pass_in_tb_writer_to_trainer

opened 07:33PM - 27 Apr 20 UTC

jaymody

+4 -1

Implements feature request for issue #4019. You should be able to pass in you…'re own `SummaryWriter`s to `Trainer` via the `tb_writer` parameter to the `__init__` function: ``` tb_writer = SummaryWriter(log_dir="my_log_dir") tb_writer.add_hparams(my_hparams_dict, my_metrics_dict) trainer = Trainer( model = model, args = training_args, train_dataset = train_dataset, tb_writer = tb_writer ) trainer.train() ```

but no example of use, so I am confused…

Any insight will be appreciated

SUNM · May 18, 2023, 11:37pm

Hi @Anna-Kay , I hope you re well. sorry, did you find any solution for your problem? I have the same question in my mind.

Anna-Kay · May 19, 2023, 2:55pm

Hello @SUNM !

The following solution was helpful for me:

SUNM · May 20, 2023, 1:02am

@Anna-Kay , many many thanks for your attention, Sorry during training I can see the saved checkpoints, but when the training is finished no checkpoints is saved for testing. all checkpoints disappear in the folder. would you please tell m e how I can sav ethe best model , my code is as follow


training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=10, evaluation_strategy="epoch", logging_strategy="epoch",save_strategy="epoch",seed=42,load_best_model_at_end=True,
        report_to="tensorboard",per_device_train_batch_size=2, save_total_limit=1,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)


Trainer(model=model, args=training_args, tokenizer=tokenizer,train_dataset=train_dataset,
        eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()

ReatKay · August 24, 2023, 7:33am

Adjusted code that shows how to save. What I changed:

Removed save_total_limit=1 (this basically limits your potential saves to 1, making “selection of the best model” rather useless, as there is always just 1 checkpoint if even.
-Assigned Trainer to a local variable, so we can use the object reference after training is done.
-Changed save strategy to steps, to enable faster first save for demonstration
-Added save_steps = 200, resulting in a checkpoint being saved every 200 steps.
-Changed evaluation_strategy=“steps” and added eval_steps = 200 (so we can sync it with the save_steps. In order to select the best checkpoint after training end, it makes sense, that we have evaluation (and not just training) metrics as up to date as possible when we save the model.
Changed logging_strategy also to “steps” , keeping everything in the same range.
Added calls to save model etc. state after training.

training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=10, evaluation_strategy="steps", logging_strategy="steps", eval_steps=200,save_strategy="steps",seed=42,load_best_model_at_end=True, save_steps = 200,
        report_to="tensorboard",per_device_train_batch_size=2,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)


trainer = Trainer(model=model, args=training_args, tokenizer=tokenizer,train_dataset=train_dataset,
        eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])})

trainer.train()
# With auto best model set, what happens after training, is that the system going through all the checkpoint records available for this context at the time (meaning: Being in the training output directory set in this script, being valid (in terms of meta data compared to this training, being not corrupted or errorous). After training the best checkpoint available is set and thus will be used by the following 2 statements.

model.save_pretrained(training_args.output_dir)
processor.save_pretrained(training_args.output_dir)

#If no error occurred, your output_dir now has all the files required to be a valid target for loading the model by using the path as source.

Topic		Replies	Views
How to read the logs created by hugging face trainer? Beginners	13	11002	April 26, 2024
How to log hparams for tensorboard? 🤗Transformers	0	397	January 16, 2023
How to log text with trainer's tensorboard tracking 🤗Transformers	0	320	December 18, 2022
Wandb for Huggingface Trainer saves only first model 🤗Transformers	0	448	April 19, 2022
Trainer API to log both Training and Validation Metrics 🤗Transformers	2	1705	July 1, 2021

Using Tensorboard SummaryWriter with HuggingFace TrainerAPI

Related topics