Using Tensorboard SummaryWriter with HuggingFace TrainerAPI

I am fine-tuning a HuggingFace transformer model (PyTorch version), using the HF Seq2SeqTrainingArguments & Seq2SeqTrainer, and I want to display in Tensorboard the train and validation losses (in the same chart).

As far as I understand in order to plot the two losses together I need to use the SummaryWriter. The HF Callbacks documenation describes a TensorBoardCallback function that can receive a tb_writer argument:

However, I cannot figure out what is the right way to use it, if it is even supposed to be used with the Trainer API.

My code looks something like this:

args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    evaluation_strategy='epoch',
    learning_rate= 1e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=num_train_epochs,
    predict_with_generate=True,
    logging_steps=logging_steps,
    report_to='tensorboard',
    push_to_hub=False,  
)

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_train_data,
    eval_dataset=tokenized_val_data,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

I am fine-tuning a HuggingFace transformer model (PyTorch version), using the HF Seq2SeqTrainingArguments & Seq2SeqTrainer, and I want to display in Tensorboard the train and validation losses (in the same chart).

As far as I understand in order to plot the two losses together I need to use the SummaryWriter. The HF Callbacks documenation describes a TensorBoardCallback function that can receive a tb_writer argument:

However, I cannot figure out what is the right way to use it, if it is even supposed to be used with the Trainer API.

My code looks something like this:

args = Seq2SeqTrainingArguments(
    output_dir=output_dir,
    evaluation_strategy='epoch',
    learning_rate= 1e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=num_train_epochs,
    predict_with_generate=True,
    logging_steps=logging_steps,
    report_to='tensorboard',
    push_to_hub=False,  
)

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_train_data,
    eval_dataset=tokenized_val_data,
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

I would assume I should include the callback to TensorBoard in the trainer, e.g.,

callbacks = [TensorBoardCallback(tb_writer=tb_writer)]

but I cannot find a comprehensive example of how to use/what to import to use it.

I also found this feature request on GitHub,

but no example of use, so I am confused…

Any insight will be appreciated

1 Like

Hi @Anna-Kay , I hope you re well. sorry, did you find any solution for your problem? I have the same question in my mind.

Hello @SUNM !

The following solution was helpful for me:

1 Like

@Anna-Kay , many many thanks for your attention, Sorry during training I can see the saved checkpoints, but when the training is finished no checkpoints is saved for testing. all checkpoints disappear in the folder. would you please tell m e how I can sav ethe best model , my code is as follow


training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=10, evaluation_strategy="epoch", logging_strategy="epoch",save_strategy="epoch",seed=42,load_best_model_at_end=True,
        report_to="tensorboard",per_device_train_batch_size=2, save_total_limit=1,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)


Trainer(model=model, args=training_args, tokenizer=tokenizer,train_dataset=train_dataset,
        eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()
1 Like

Adjusted code that shows how to save. What I changed:

  • Removed save_total_limit=1 (this basically limits your potential saves to 1, making “selection of the best model” rather useless, as there is always just 1 checkpoint if even.
    -Assigned Trainer to a local variable, so we can use the object reference after training is done.
    -Changed save strategy to steps, to enable faster first save for demonstration
    -Added save_steps = 200, resulting in a checkpoint being saved every 200 steps.
    -Changed evaluation_strategy=“steps” and added eval_steps = 200 (so we can sync it with the save_steps. In order to select the best checkpoint after training end, it makes sense, that we have evaluation (and not just training) metrics as up to date as possible when we save the model.
  • Changed logging_strategy also to “steps” , keeping everything in the same range.
  • Added calls to save model etc. state after training.
training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=10, evaluation_strategy="steps", logging_strategy="steps", eval_steps=200,save_strategy="steps",seed=42,load_best_model_at_end=True, save_steps = 200,
        report_to="tensorboard",per_device_train_batch_size=2,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)


trainer = Trainer(model=model, args=training_args, tokenizer=tokenizer,train_dataset=train_dataset,
        eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])})

trainer.train()
# With auto best model set, what happens after training, is that the system going through all the checkpoint records available for this context at the time (meaning: Being in the training output directory set in this script, being valid (in terms of meta data compared to this training, being not corrupted or errorous). After training the best checkpoint available is set and thus will be used by the following 2 statements.

model.save_pretrained(training_args.output_dir)
processor.save_pretrained(training_args.output_dir)

#If no error occurred, your output_dir now has all the files required to be a valid target for loading the model by using the path as source.

1 Like