Tracking resource utilization per process with callbacks

banheesoo · October 4, 2024, 5:57am

Is it possible to use transformer callbacks to track the amount of resources (e.g. GPU memory, CPU usage) used by each process during training? If so, how can this be implemented? Specifically, I’m interested in learning how to call a logger for each process individually.

John6666 · October 4, 2024, 11:34am

GPU usage could be output like this?
I tried to make the output timing after each step.
I don’t know about logging because I don’t know the log specifications.

import torch
import psutil
from transformers import TrainerCallback

class MyCallback(TrainerCallback):
    def __init__(self):
        self.num_devices = torch.cuda.device_count() if torch.cuda.is_available() else 0

    def show_usage(self):
        for i in range(self.num_devices):
            device = torch.device(f"cuda:{i}")
            print(f"GPU usage: {device}: Allocated memory: {torch.cuda.memory_allocated(device)} bytes / Max allocated memory: {torch.cuda.max_memory_allocated(device)}")
        print(f"CPU usage: {psutil.cpu_percent(interval=1)}% / CPU usage per core: {psutil.cpu_percent(percpu=True)}%")
        print(f"RAM usage: {psutil.virtual_memory().percent}%")

    def on_step_end(self, args, state, control, **kwargs):
        self.show_usage()

trainer = Trainer(
    model,
    args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    callbacks=[MyCallback],  # We can either pass the callback class this way or an instance of it (MyCallback())
)

Although not for callbacks, this library may also be useful for monitoring usage.

Topic		Replies	Views
Memory continuously increasing during `compute_loss()` 🤗Transformers	0	385	December 4, 2023
Using 3 GPUs for training with Trainer() of transformers 🤗Transformers	2	2287	October 18, 2023
Low GPU utilization with the Decision Transformer Models	6	443	October 30, 2024
GPU memory not being freed between batches 🤗Transformers	0	1714	June 24, 2022
Need help performance issues transformers.AutoModelForCausalLM.from_pretrained( 'mosaicml/mpt-7b-instruct' Beginners	0	930	June 12, 2023

Tracking resource utilization per process with callbacks

Related topics