Logs of training and validation loss

Hi, I made this post to see if anyone knows how can I save in the logs the results of my training and validation loss.
image

I’m using this code:

*training_args = TrainingArguments(*
*    output_dir='./results',          # output directory*
*    num_train_epochs=3,              # total number of training epochs*
*    per_device_train_batch_size=16,  # batch size per device during training*
*    per_device_eval_batch_size=16,   # batch size for evaluation*
*    warmup_steps=50,                 # number of warmup steps for learning rate scheduler*
*    weight_decay=0.01,               # strength of weight decay*
*    logging_dir='./logs',            # directory for storing logs*
*    logging_steps=20,*
*    evaluation_strategy="steps"*
*)*

*trainer = Trainer(*
*    model=model,                         # the instantiated 🤗 Transformers model to be trained*
*    args=training_args,                  # training arguments, defined above*
*    train_dataset=train_dataset,         # training dataset*
*    eval_dataset=val_dataset             # evaluation dataset*
*)*

And I thought that using logging_dir and logging_steps would achieve that but in such logs all I see is this:

*output_dir ^A"^X*
*^Toverwrite_output_dir ^B"^L*
*^Hdo_train ^B"^K*
*^Gdo_eval ^A"^N*

*do_predict ^B"^\*
*^Xevaluate_during_training ^B"^W*
*^Sevaluation_strategy ^A"^X*
*^Tprediction_loss_only ^B"^_*
*^[per_device_train_batch_size ^C"^^*
*^Zper_device_eval_batch_size ^C"^\*
*^Xper_gpu_train_batch_size ^A"^[*
*^Wper_gpu_eval_batch_size ^A"^_*
*^[gradient_accumulation_steps ^C"^[*
*^Weval_accumulation_steps ^A"^Q*
*^Mlearning_rate ^C"^P*
*^Lweight_decay ^C"^N*

And it goes on like that.
Any ideas will be welcome. :slight_smile:

My system instalation:
- transformers version: 3.4.0
- Platform: Linux-3.10.0-1127.13.1.el7.x86_64-x86_64-with-centos-7.8.2003-Core
- Python version: 3.6.8
- PyTorch version (GPU?): 1.6.0 (False)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?: No
- Using distributed or parallel set-up in script?: No

2 Likes

Hi!

I was also recently trying to save my loss values at each logging_steps into a .txt file.

There might be a parameter I am unaware of, but meanwhile I pulled from git the latest version of the transformer library and slightly modified the trainer.py to include in def log(self, logs: Dict[str, float]) -> None: the following lines to save my logs into a .txt file:

# TODO PRINT ADDED BY XXX
logSave = open('lossoutput.txt', 'a')
logSave.write(str(output) + '\n')
logSave.close()

Happy to hear if there is a less ‘cowboy’ way to do this, one that would not require modifying trainer.py :sweat_smile:

You could also subclass Trainer and override the log method to do this (which is less cowboy-y :wink: ). @lysandre is the logger master and might know a more clever way to directly redirect the logs from our logger.

3 Likes

The things I’m thinking off are way more cowboy than what you’re doing @aberquand! I think @sgugger’s solution is the cleanest.

You could redirect all the logs to a text file and then filter them out, but your approach here sounds better.

I know this is late as hell, but I will leave this here for future reference and if anyone comes across this post. I think that even less cowboy way would be to use callback:

class LogCallback(transformers.TrainerCallback):
    def on_evaluate(self, args, state, control, **kwargs):
        # calculate loss here

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    compute_metrics=compute_metrics,
    callbacks=[LogCallback],
)

Another even less cowboy way (without implementing anything) is that when you use those logging_steps args etc. You can access those logs after training is complete:

trainer.state.log_history

You should have metrics and losses from all steps over training. Hope this will help someone in future.

22 Likes

This is the way!

I have issues with the code that you provided.

import matplotlib.pyplot as plt
import numpy as np
from transformers import TrainerCallback

class LogCallback(TrainerCallback):
def init(self, state):
self.state = state

def on_step_end(self, args, state, control, **kwargs):
    if state.global_step % 10 == 0:
        for name, value in state.log_history[-1].items():
            print(f"{name} at step {state.global_step}: {value}")

training_args = TrainingArguments(
output_dir=“gpt_model”,
overwrite_output_dir=True,
learning_rate=7e-5,
weight_decay=0.01,
num_train_epochs=350,
logging_steps=50,
save_total_limit=2,
per_device_train_batch_size=3,
#gradient_accumulation_steps=4,
save_steps=10_000,
evaluation_strategy=‘no’,
#load_best_model_at_end=True
)

trainer = Trainer(
model=mpm_model,
args=training_args,
data_collator=data_collator,
train_dataset=mpm_dataset[‘train’],
#compute_metrics=compute_metrics,
callbacks=[LogCallback],
)

train_output = trainer.train()
log_callback = LogCallback(state)

I want to store the logs to plot the loss curves, how can I modify my code to fix this issue?
I am not using the validation data as my entire data is used for training due to the small size of the dataset.

Can you please help me with this issue. I want to store the log history of my training model so that I want to plot the loss curves.

Here is my code that I tried to use and here I am just using the entire data as training data, I did not split my dataset due to its small size.

Code:
import matplotlib.pyplot as plt
import numpy as np
from transformers import TrainerCallback

class LogCallback(TrainerCallback):
def init(self, state):
self.state = state

def on_step_end(self, args, state, control, **kwargs):
    if state.global_step % 10 == 0:
        for name, value in state.log_history[-1].items():
            print(f"{name} at step {state.global_step}: {value}")

training_args = TrainingArguments(
output_dir=“gpt_model”,
overwrite_output_dir=True,
learning_rate=7e-5,
weight_decay=0.01,
num_train_epochs=350,
logging_steps=50,
save_total_limit=2,
per_device_train_batch_size=3,
#gradient_accumulation_steps=4,
save_steps=10_000,
evaluation_strategy=‘no’,
#load_best_model_at_end=True
)

trainer = Trainer(
model=mpm_model,
args=training_args,
data_collator=data_collator,
train_dataset=mpm_dataset[‘train’],
#compute_metrics=compute_metrics,
callbacks=[LogCallback],
)

train_output = trainer.train()
log_callback = LogCallback(state)

Error:

TypeError Traceback (most recent call last)
Cell In[31], line 32
13 print(f"{name} at step {state.global_step}: {value}")
17 training_args = TrainingArguments(
18 output_dir=“gpt_model”,
19 overwrite_output_dir=True,
(…)
29 #load_best_model_at_end=True
30 )
—> 32 trainer = Trainer(
33 model=mpm_model,
34 args=training_args,
35 data_collator=data_collator,
36 train_dataset=mpm_dataset[‘train’],
37 #compute_metrics=compute_metrics,
38 callbacks=[LogCallback],
39 )
41 train_output = trainer.train()
42 log_callback = LogCallback(state)

File /scratch/kkonatha/kk/lib/python3.9/site-packages/transformers/trainer.py:519, in Trainer.init(self, model, args, data_collator, train_dataset, eval_dataset, tokenizer, model_init, compute_metrics, callbacks, optimizers, preprocess_logits_for_metrics)
517 default_callbacks = DEFAULT_CALLBACKS + get_reporting_integration_callbacks(self.args.report_to)
518 callbacks = default_callbacks if callbacks is None else default_callbacks + callbacks
→ 519 self.callback_handler = CallbackHandler(
520 callbacks, self.model, self.tokenizer, self.optimizer, self.lr_scheduler
521 )
522 self.add_callback(PrinterCallback if self.args.disable_tqdm else DEFAULT_PROGRESS_CALLBACK)
524 # Will be set to True by self._setup_loggers() on first call to self.log().

File /scratch/kkonatha/kk/lib/python3.9/site-packages/transformers/trainer_callback.py:296, in CallbackHandler.init(self, callbacks, model, tokenizer, optimizer, lr_scheduler)
294 self.callbacks =
295 for cb in callbacks:
→ 296 self.add_callback(cb)
297 self.model = model
298 self.tokenizer = tokenizer

File /scratch/kkonatha/kk/lib/python3.9/site-packages/transformers/trainer_callback.py:313, in CallbackHandler.add_callback(self, callback)
312 def add_callback(self, callback):
→ 313 cb = callback() if isinstance(callback, type) else callback
314 cb_class = callback if isinstance(callback, type) else callback.class
315 if cb_class in [c.class for c in self.callbacks]:

TypeError: init() missing 1 required positional argument: ‘state’

Hi @perch, @sgugger , I hope you are well. I get this logs from trainer in my folder, How I can use them? open them? I wana to see training loss and validation loss. many thanks

HI @workpiece , I hope you are well. sorry, did you fix your problem to save the logs ? how did you visualized them? many thnaks for your help. this sis my argument:

training_args = TrainingArguments(output_dir=Results_Path, learning_rate=5e-5,num_train_epochs=3, evaluation_strategy="epoch", logging_strategy="epoch",save_strategy="epoch",seed=42,load_best_model_at_end=True,logging_steps=5000,
                                per_device_train_batch_size=2, save_total_limit=1,per_device_eval_batch_size=2,warmup_steps=100, weight_decay=0.01, logging_dir=Results_Path)


Trainer(model=model, args=training_args,train_dataset=train_dataset,
        eval_dataset=val_dataset,data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                                              'attention_mask': torch.stack([f[1] for f in data]),
                                                              'labels': torch.stack([f[0] for f in data])}).train()