GPU utilization almost always 0 during training

Hi there, I am fine-tuning a Donut Cord-v2 model with my invoice data which is around 360 GB in size when preprocessed and saved on disk as a dataset. I am following this notebook almost exactly except I have 6 training epochs instead of 3.

I am training on single Nvidia H100 SXM GPU / Intel Xeon® Gold 6448Y / 128 GB RAM.

Whenever I start training, and inspect CPU and GPU utilization using htop and nvidia-smi, I see that CPU utilization is at 10-12%, used by python, GPU memory is almost 90% filled constantly, but GPU Utilization is almost always 0. If I keep refreshing the output of nvidia-smi, once every 10-12 seconds the utilization will jump to 100% and then go back to 0 immediately. I cant help but feel ther eis a bottleneck between my CPU and GPU, where CPU attempts to constantly process data and send it to GPU, GPU processes it very fast, and just idles, awaiting for the next batch from cpu. I load already pre-processed dataset from disk like so:

from datasets import load_from_disk
processed_dataset = load_from_disk(r"/dataset/dataset_final")

My processor config is as follows:


from transformers import DonutProcessor

new_special_tokens = [] # new tokens which will be added to the tokenizer
task_start_token = "<s>"  # start of task token
eos_token = "</s>" # eos token of tokenizer

processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")

# add new special tokens to tokenizer
processor.tokenizer.add_special_tokens({"additional_special_tokens": new_special_tokens + [task_start_token] + [eos_token]})

# we update some settings which differ from pretraining; namely the size of the images + no rotation required
processor.feature_extractor.size = [1200,1553] # should be (width, height)
processor.feature_extractor.do_align_long_axis = False

My model config is:



import torch
from transformers import VisionEncoderDecoderModel, VisionEncoderDecoderConfig

#print(torch.cuda.is_available())

# Load model from huggingface.co
model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")

# Resize embedding layer to match vocabulary size
new_emb = model.decoder.resize_token_embeddings(len(processor.tokenizer))
print(f"New embedding size: {new_emb}")
# Adjust our image size and output sequence lengths
model.config.encoder.image_size = processor.feature_extractor.size[::-1] # (height, width)
model.config.decoder.max_length = len(max(processed_dataset["train"]["labels"], key=len))

# Add task token for decoder to start
model.config.pad_token_id = processor.tokenizer.pad_token_id
model.config.decoder_start_token_id = processor.tokenizer.convert_tokens_to_ids(['<s>'])[0]

And my training code is:


import gc
gc.collect()

torch.cuda.empty_cache()


from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

import logging
logging.basicConfig(level=logging.INFO)

# Arguments for training
training_args = Seq2SeqTrainingArguments(
    output_dir=r"/trained",  # Specify a local directory to save the model
    num_train_epochs=6,
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    weight_decay=0.01,
    fp16=True,
    logging_steps=50,
    save_total_limit=2,
    evaluation_strategy="no",
    save_strategy="epoch",
    predict_with_generate=True,
    report_to="none",
    # Disable push to hub
    push_to_hub=False
   
)

# Create Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=processed_dataset["train"],
)



# Start training
trainer.train()

# Save the trained model
model.save_pretrained(r"/trained/final/model")

# Save the processor (tokenizer and feature extractor)
processor.save_pretrained(r"/trained/final/processor")

The estimated time to complete the training with 6 epochs, with 360 GB dataset, is 54 hours. When I run the same exact code on my PC that has Intel i9 11900KF / RTX 3050, I see GPU utilization constantly at 100%. Is there a bottleneck in my code? Why does CPU keep processing so much on already preprocessed dataset? Cuda 12.6

Does it make sense to change the dataloader_num_workers parameter of Seq2SeqTrainer to >0 value, since my RAM and CPU core count allows it? (and since CPU utilization is at 10-12% max.)

1 Like

This is by a Amatuer in AI, so the answer may be a bit short of an expert-answer.

  1. Your Server and Local-PC has different configuration. So the utilizations will never match.

  2. I have a AI model which I run on Nvidia-GPU, it has very less GPU-Utilization and 10% CPU-Utilization while training (same as yours)

1 Like

I am also new in ML field, but having worked 12+ years in software engineering, I can tell you that having 10% cpu utiliztion and barely any utilization by GPU every 10 seconds means there is a bottleneck somewhere or inefficiency in the way data is processed. At the very least, CPU must use every core available to it to process data, not just a single python process. In data loading part, that can be done and investigated by setting dataloader_num_workers parameter to non-zero value. I would rather all my 120 GB RAM is filled with pre-processed data to avoid constant reading from disk, etc. I know Torch and later Hugging Face engineers would have though of all this, there is just a lot of material to read and analyze to know for sure which is the best training tactics in terms of performance and efficiency.

2 Likes