How to read the logs created by hugging face trainer?

args = TrainingArguments(
    output_dir='./results'
    evaluation_strategy="epoch",
    save_strategy="epoch",
    logging_dir='./logs',
    label_names=["labels"],
)

How to read the training and validation losses?

I am getting logs like this /content/logs/events.out.tfevents.1677075882.2041f82e97b0.219.0

The logging_dir is where Tensorboard files are stored. Since you’re not specifying --logging_strategy/--logging_steps, the Trainer is logging every 500 steps by default. You can visualize the data in a web browser with the following command:
tensorboard --logdir content/logs

This will output some text in the terminal, which should contain a localhost address (something like “http://localhost:6006/”). Copy&paste that to a web browser (or do ctrl+cllick) and you should see a bunch of tensorboard plots.

Hi @mapama247 , sorry, should I write these command in the terminal or in the python environment?

Hi @mapama247 , I would appreciate if you could let me know how I can see my tensors my system is linux. my training code is as follow: and Results_Path=‘/home/nlpproject/’

training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=15, evaluation_strategy="epoch", logging_strategy="epoch",save_strategy="epoch",seed=42,load_best_model_at_end=True,logging_steps=5000,
                                per_device_train_batch_size=2, save_total_limit=1,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)


Trainer(model=model, args=training_args,train_dataset=train_dataset,
        eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()

hi @illstart , I hop eu are fine. sorry, I wana visualize the logs training and validation loss in trainer. would you please tell me how did u do that?

 training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=15, evaluation_strategy="epoch", logging_strategy="epoch",save_strategy="epoch",seed=42,load_best_model_at_end=True,logging_steps=5000,
                                per_device_train_batch_size=2, save_total_limit=1,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)

Trainer(model=model, args=training_args,train_dataset=train_dataset,
        eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()

In the terminal, as long as you have tensorboard installed.

Do you see some files (with weird names) in /home/nlpproject/? In that case, open a terminal and do:

tensorboard --logdir /home/nlpproject/

@mapama247 , many many thanks for your reply. Sorry, dusing training I can see the saved checkpoints, but when the training is finished no checkpints is saved for testing. all checkpoints disappear in the folder. would you please tell me how I can save the best model , my code is as follow


training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=10, evaluation_strategy="epoch", logging_strategy="epoch",save_strategy="epoch",seed=42,load_best_model_at_end=True,
        report_to="tensorboard",per_device_train_batch_size=2, save_total_limit=1,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)


Trainer(model=model, args=training_args, tokenizer=tokenizer,train_dataset=train_dataset,
        eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()

Ah okay… but this is a different problem that has nothing to do with tensorboard :man_shrugging: The reason why you see that the checkpoints disappear is that you have the argument save_total_limit=1, which limits the number of saved checkpoints to 1. Just remove that or increase the number.

Hi @mapama247 , sorry, do you know how I can save the model for each epoch? regardless of it is the best model or not? I want to save model after each epoch. if I change teh steps to epoch it won’t save any checkpoints at the end. I want to save model after each epoch.
this is my code

training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=15,evaluation_strategy="steps", logging_strategy="steps",save_strategy="steps",save_steps=500,seed=42,load_best_model_at_end=True,logging_steps=500,
report_to="tensorboard",per_device_train_batch_size=2,eval_steps=500,save_total_limit=2,per_device_eval_batch_size=2,
warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)

Remove save_total_limit=2, set save_strategy, evaluation_strategy and logging_strategy to “epoch”, and remove save_steps=500, eval_steps=500 and logging_steps=500. This way you should get 15 folders with a different checkpoint each (one per epoch) and the best checkpoint directly on your output_dir.

1 Like

@mapama247 many thanks for your reply. sorry, have u been use multiple GPU by using trainer API? I use it but the results are very strange in comparison with using 1 gpu. indeed I didn’t change anything in my code for using 1 GPU and let trainer use all gpu but results are very strange. can you please help me with that?

I ran into the same issue with multi-gpu training. Were you able to resolve this?