SFTTrainer training very slow. Is this training speed expected?

Hello,

I am currently trying to perform full fine tuning on the ai-forever/mGPT model (1.3B parameters) using a single A100 GPU (40GB VRAM) on Google Colab. However when running the training is very slow: ~0.06 it/s.

I was wondering whether this is the expected training speed or is there some issue with my code? And if it is an issue, what could a possible fix be?

Here is my code:

dataset = load_dataset("allenai/c4", "lt")

train_dataset = dataset["train"]
eval_dataset = dataset["validation"]

train_dataset = train_dataset.take(10000)
eval_dataset = eval_dataset.take(1000)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    eval_dataset = eval_dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,

    args = TrainingArguments(
        gradient_accumulation_steps = 4,
        gradient_checkpointing = True,

        num_train_epochs = 3,
        learning_rate = 2e-4,
        per_device_train_batch_size = 4,
        per_device_eval_batch_size = 4,

        seed = 99,
        output_dir = "./checkpoints",

        save_strategy = "steps",
        eval_strategy = "steps",

        save_steps = 0.1,
        eval_steps = 0.1,
        logging_steps = 0.1,
        load_best_model_at_end = True
    ),
)

trainer_stats = trainer.train()

And the trainer output:

It says it will take ~10hrs to process 10k examples from the c4 dataset. Is this normal?

1 Like

(1.3B parameters) using a single A100 GPU (40GB VRAM)

I’m not a training expert myself, but I think this is way too slow for the specs and model size…
Likely causes could be not using the GPU properly, or small batch size, but usually you don’t have to specify anything in particular to get the kind of use that was there…

These are the relevant package versions and a screenshot of GPU usage:

Package                            Version
---------------------------------- -------------------
accelerate                         0.34.2
bitsandbytes                       0.44.1
datasets                           3.1.0
peft                               0.13.2
torch                              2.5.0+cu121
trl                                0.12.0

It does seem to load the model to the GPU, but for some reason it’s still slow.

I tried to use keep_in_memory=True when loading the dataset, but it did not help.

I also tried pre-tokenizing the dataset and using Trainer instead of SFTTrainer but the performance was similar.

Could this be an issue with the packages/versions since the code is quite simple and I cannot figure out what might cause this?

1 Like

Hello,

for anyone interested in the answer. This is the expected training speed for the provided hardware and model size. What I ended up doing to improve the speed substantially is:

  1. Lower the context length from 2048 to 512.
  2. Use mixed precision training.
  3. Using a quantized optimizer.

Step 1 had the biggest impact on training speed. I trained with the lowered context window for ~95% of the data I had and then increased it back to 2048 for the remaining 5%.

1 Like

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.