Hi there, I am fine-tuning a Donut Cord-v2 model with my invoice data which is around 360 GB in size when preprocessed and saved on disk as a dataset. I am following this notebook almost exactly except I have 6 training epochs instead of 3.
I am training on single Nvidia H100 SXM GPU / Intel Xeon® Gold 6448Y / 128 GB RAM.
Whenever I start training, and inspect CPU and GPU utilization using htop and nvidia-smi, I see that CPU utilization is at 10-12%, used by python, GPU memory is almost 90% filled constantly, but GPU Utilization is almost always 0. If I keep refreshing the output of nvidia-smi, once every 10-12 seconds the utilization will jump to 100% and then go back to 0 immediately. I cant help but feel ther eis a bottleneck between my CPU and GPU, where CPU attempts to constantly process data and send it to GPU, GPU processes it very fast, and just idles, awaiting for the next batch from cpu. I load already pre-processed dataset from disk like so:
from datasets import load_from_disk
processed_dataset = load_from_disk(r"/dataset/dataset_final")
My processor config is as follows:
from transformers import DonutProcessor
new_special_tokens = [] # new tokens which will be added to the tokenizer
task_start_token = "<s>" # start of task token
eos_token = "</s>" # eos token of tokenizer
processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
# add new special tokens to tokenizer
processor.tokenizer.add_special_tokens({"additional_special_tokens": new_special_tokens + [task_start_token] + [eos_token]})
# we update some settings which differ from pretraining; namely the size of the images + no rotation required
processor.feature_extractor.size = [1200,1553] # should be (width, height)
processor.feature_extractor.do_align_long_axis = False
My model config is:
import torch
from transformers import VisionEncoderDecoderModel, VisionEncoderDecoderConfig
#print(torch.cuda.is_available())
# Load model from huggingface.co
model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")
# Resize embedding layer to match vocabulary size
new_emb = model.decoder.resize_token_embeddings(len(processor.tokenizer))
print(f"New embedding size: {new_emb}")
# Adjust our image size and output sequence lengths
model.config.encoder.image_size = processor.feature_extractor.size[::-1] # (height, width)
model.config.decoder.max_length = len(max(processed_dataset["train"]["labels"], key=len))
# Add task token for decoder to start
model.config.pad_token_id = processor.tokenizer.pad_token_id
model.config.decoder_start_token_id = processor.tokenizer.convert_tokens_to_ids(['<s>'])[0]
And my training code is:
import gc
gc.collect()
torch.cuda.empty_cache()
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
import logging
logging.basicConfig(level=logging.INFO)
# Arguments for training
training_args = Seq2SeqTrainingArguments(
output_dir=r"/trained", # Specify a local directory to save the model
num_train_epochs=6,
learning_rate=2e-5,
per_device_train_batch_size=8,
weight_decay=0.01,
fp16=True,
logging_steps=50,
save_total_limit=2,
evaluation_strategy="no",
save_strategy="epoch",
predict_with_generate=True,
report_to="none",
# Disable push to hub
push_to_hub=False
)
# Create Trainer
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=processed_dataset["train"],
)
# Start training
trainer.train()
# Save the trained model
model.save_pretrained(r"/trained/final/model")
# Save the processor (tokenizer and feature extractor)
processor.save_pretrained(r"/trained/final/processor")
The estimated time to complete the training with 6 epochs, with 360 GB dataset, is 54 hours. When I run the same exact code on my PC that has Intel i9 11900KF / RTX 3050, I see GPU utilization constantly at 100%. Is there a bottleneck in my code? Why does CPU keep processing so much on already preprocessed dataset? Cuda 12.6
Does it make sense to change the dataloader_num_workers parameter of Seq2SeqTrainer
to >0 value, since my RAM and CPU core count allows it? (and since CPU utilization is at 10-12% max.)