SFTTrainer training very slow. Is this training speed expected?

Hello,

I am currently trying to perform full fine tuning on the ai-forever/mGPT model (1.3B parameters) using a single A100 GPU (40GB VRAM) on Google Colab. However when running the training is very slow: ~0.06 it/s.

I was wondering whether this is the expected training speed or is there some issue with my code? And if it is an issue, what could a possible fix be?

Here is my code:

dataset = load_dataset("allenai/c4", "lt")

train_dataset = dataset["train"]
eval_dataset = dataset["validation"]

train_dataset = train_dataset.take(10000)
eval_dataset = eval_dataset.take(1000)

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset,
    eval_dataset = eval_dataset,
    dataset_text_field = "text",
    max_seq_length = 2048,

    args = TrainingArguments(
        gradient_accumulation_steps = 4,
        gradient_checkpointing = True,

        num_train_epochs = 3,
        learning_rate = 2e-4,
        per_device_train_batch_size = 4,
        per_device_eval_batch_size = 4,

        seed = 99,
        output_dir = "./checkpoints",

        save_strategy = "steps",
        eval_strategy = "steps",

        save_steps = 0.1,
        eval_steps = 0.1,
        logging_steps = 0.1,
        load_best_model_at_end = True
    ),
)

trainer_stats = trainer.train()

And the trainer output:

It says it will take ~10hrs to process 10k examples from the c4 dataset. Is this normal?

1 Like

(1.3B parameters) using a single A100 GPU (40GB VRAM)

I’m not a training expert myself, but I think this is way too slow for the specs and model size…
Likely causes could be not using the GPU properly, or small batch size, but usually you don’t have to specify anything in particular to get the kind of use that was there…

These are the relevant package versions and a screenshot of GPU usage:

Package                            Version
---------------------------------- -------------------
accelerate                         0.34.2
bitsandbytes                       0.44.1
datasets                           3.1.0
peft                               0.13.2
torch                              2.5.0+cu121
trl                                0.12.0

It does seem to load the model to the GPU, but for some reason it’s still slow.

I tried to use keep_in_memory=True when loading the dataset, but it did not help.

I also tried pre-tokenizing the dataset and using Trainer instead of SFTTrainer but the performance was similar.

Could this be an issue with the packages/versions since the code is quite simple and I cannot figure out what might cause this?

1 Like