SFTTrainer training very slow. Is this training speed expected?

domce20 · November 3, 2024, 9:54pm

Hello,

I am currently trying to perform full fine tuning on the ai-forever/mGPT model (1.3B parameters) using a single A100 GPU (40GB VRAM) on Google Colab. However when running the training is very slow: ~0.06 it/s.

I was wondering whether this is the expected training speed or is there some issue with my code? And if it is an issue, what could a possible fix be?

Here is my code:

dataset = load_dataset("allenai/c4", "lt")

train_dataset = dataset["train"]
eval_dataset = dataset["validation"]

train_dataset = train_dataset.take(10000)
eval_dataset = eval_dataset.take(1000)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    eval_dataset = eval_dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,

    args = TrainingArguments(
        gradient_accumulation_steps = 4,
        gradient_checkpointing = True,

        num_train_epochs = 3,
        learning_rate = 2e-4,
        per_device_train_batch_size = 4,
        per_device_eval_batch_size = 4,

        seed = 99,
        output_dir = "./checkpoints",

        save_strategy = "steps",
        eval_strategy = "steps",

        save_steps = 0.1,
        eval_steps = 0.1,
        logging_steps = 0.1,
        load_best_model_at_end = True
    ),
)

trainer_stats = trainer.train()

And the trainer output:

It says it will take ~10hrs to process 10k examples from the c4 dataset. Is this normal?

John6666 · November 4, 2024, 1:32am

(1.3B parameters) using a single A100 GPU (40GB VRAM)

I’m not a training expert myself, but I think this is way too slow for the specs and model size…
Likely causes could be not using the GPU properly, or small batch size, but usually you don’t have to specify anything in particular to get the kind of use that was there…

domce20 · November 4, 2024, 9:53am

These are the relevant package versions and a screenshot of GPU usage:

Package                            Version
---------------------------------- -------------------
accelerate                         0.34.2
bitsandbytes                       0.44.1
datasets                           3.1.0
peft                               0.13.2
torch                              2.5.0+cu121
trl                                0.12.0

It does seem to load the model to the GPU, but for some reason it’s still slow.

I tried to use keep_in_memory=True when loading the dataset, but it did not help.

I also tried pre-tokenizing the dataset and using Trainer instead of SFTTrainer but the performance was similar.

Could this be an issue with the packages/versions since the code is quite simple and I cannot figure out what might cause this?

domce20 · February 8, 2025, 6:49pm

Hello,

for anyone interested in the answer. This is the expected training speed for the provided hardware and model size. What I ended up doing to improve the speed substantially is:

Lower the context length from 2048 to 512.
Use mixed precision training.
Using a quantized optimizer.

Step 1 had the biggest impact on training speed. I trained with the lowered context window for ~95% of the data I had and then increased it back to 2048 for the remaining 5%.

system · February 9, 2025, 6:49am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
SFTTrainer training very slow on GPU. Is this training speed expected? 🤗Transformers	4	300	February 8, 2025
SFTTrainer too slow during the build (or ingestion) phase 🤗Transformers	0	94	November 27, 2024
SFTTrainer takes up so much ram that it breaks an A100 GPU 🤗Transformers	0	205	July 6, 2024
Very slow training (>5mins per batch) - code review request Research	2	646	October 11, 2023
Reproduce SFTTrainer with Accelerate and Pytorch 🤗Accelerate	0	43	May 18, 2025

SFTTrainer training very slow. Is this training speed expected?

Related topics