Is it possible to use transformer callbacks to track the amount of resources (e.g. GPU memory, CPU usage) used by each process during training? If so, how can this be implemented? Specifically, I’m interested in learning how to call a logger for each process individually.
1 Like
GPU usage could be output like this?
I tried to make the output timing after each step.
I don’t know about logging because I don’t know the log specifications.
import torch
import psutil
from transformers import TrainerCallback
class MyCallback(TrainerCallback):
def __init__(self):
self.num_devices = torch.cuda.device_count() if torch.cuda.is_available() else 0
def show_usage(self):
for i in range(self.num_devices):
device = torch.device(f"cuda:{i}")
print(f"GPU usage: {device}: Allocated memory: {torch.cuda.memory_allocated(device)} bytes / Max allocated memory: {torch.cuda.max_memory_allocated(device)}")
print(f"CPU usage: {psutil.cpu_percent(interval=1)}% / CPU usage per core: {psutil.cpu_percent(percpu=True)}%")
print(f"RAM usage: {psutil.virtual_memory().percent}%")
def on_step_end(self, args, state, control, **kwargs):
self.show_usage()
trainer = Trainer(
model,
args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
callbacks=[MyCallback], # We can either pass the callback class this way or an instance of it (MyCallback())
)
Although not for callbacks, this library may also be useful for monitoring usage.