SFTTrainer Doubling Speed on a Single GPU with DeepSpeed: Proposal for an Update to the Official Documentation and Verification Report

neilus · March 7, 2025, 11:12am

Official Hugging Face Transformers documentation states that

“if your model fits onto a single GPU and you have enough space to fit a small batch size, you don’t need to use DeepSpeed as it’ll only slow things down.”

However, our experiments have shown that even in a single GPU environment (e.g., Colab), incorporating DeepSpeed can lead to a 2.0× speedup.

In this topic, I share the steps and verification results that suggest the documentation could be updated to reflect these findings.

In our tests using the Trainer (or SFTTrainer from trl), simply adding about six lines of code to enable DeepSpeed yielded nearly half the training time and reduced VRAM usage, all without significantly altering existing pipelines.

For reproducibility and result verification, please refer to the Colab and Wandb links.

Environment
- GPU: Google Colab (L4)
- Model: Qwen2.5-7B-Instruct
- Dataset: stanfordnlp/imdb (using 1,000 samples)
Comparison Conditions
1. Without DeepSpeed (without installing mpi4py or deepspeed)
2. Without DeepSpeed (with mpi4py and deepspeed installed)
3. With DeepSpeed enabled (using ZeRO-1)

Code Example

Installation of Libraries

!pip install -q huggingface_hub==0.29.1
!pip install -q transformers==4.49.0
!pip install -q bitsandbytes==0.45.3
!pip install -q peft==0.14.0
!pip install -q accelerate==1.4.0
!pip install -q datasets==3.3.2
!pip install -q trl==0.15.2
!pip install -q mpi4py==4.0.3  # Comment out in exp001
!pip install -q deepspeed==0.16.4  # Comment out in exp001
!pip install -q flash-attn==2.7.4.post1 --no-build-isolation

Without DeepSpeed (exp001 / exp002)

import os

import wandb
import torch
from datasets import load_dataset
from huggingface_hub import snapshot_download
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from trl import SFTConfig, SFTTrainer

# Wandb configuration
wandb.login()
os.environ["WANDB_PROJECT"] = "1gpu-deepspeed"

# Download model
model_name = "Qwen/Qwen2.5-7B-Instruct"
snapshot_download(repo_id=model_name, local_dir_use_symlinks=False, revision="main")

# Load dataset
dataset = load_dataset("stanfordnlp/imdb", split="train")
dataset = dataset.select(range(1000))

run_name = "exp001_baseline"  # Alternatively "exp002_only_pip_install_deepspeed"

# Training configuration
training_args = SFTConfig(
    max_seq_length=512,
    run_name=run_name,
    output_dir="/tmp",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    optim='adamw_torch',
    logging_steps=1,
    learning_rate=1e-4,
    weight_decay=0.01,
    lr_scheduler_type="cosine",
    seed=1024,
    bf16=True,
)

# LoRA configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    target_modules="all-linear",
    task_type=TaskType.CAUSAL_LM,
)

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2"
)
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
model = get_peft_model(model, lora_config)

trainer = SFTTrainer(
    model,
    train_dataset=dataset,
    args=training_args,
)

# Start training
trainer.train()

wandb.finish()

With DeepSpeed Enabled (exp003)

import os

import wandb
import torch
from datasets import load_dataset
from huggingface_hub import snapshot_download
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from trl import SFTConfig, SFTTrainer

# Set environment variables for DeepSpeed distributed training (for single GPU)
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "9994"
os.environ["RANK"] = "0"
os.environ["LOCAL_RANK"] = "0"
os.environ["WORLD_SIZE"] = "1"

# Wandb configuration
wandb.login()
os.environ["WANDB_PROJECT"] = "1gpu-deepspeed"

# Download model
model_name = "Qwen/Qwen2.5-7B-Instruct"
snapshot_download(repo_id=model_name, local_dir_use_symlinks=False, revision="main")

# Load dataset
dataset = load_dataset("stanfordnlp/imdb", split="train")
dataset = dataset.select(range(1000))

run_name = "exp003_ZeRO-1"

# Training configuration (with deepspeed argument)
training_args = SFTConfig(
    max_seq_length=512,
    run_name=run_name,
    output_dir="/tmp",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    optim='adamw_torch',
    logging_steps=1,
    learning_rate=1e-4,
    weight_decay=0.01,
    lr_scheduler_type="cosine",
    seed=1024,
    bf16=True,
    deepspeed="ds_config_zero1.json",  # Specify the DeepSpeed configuration file
)

# LoRA configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    target_modules="all-linear",
    task_type=TaskType.CAUSAL_LM,
)

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2"
)
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
model = get_peft_model(model, lora_config)

trainer = SFTTrainer(
    model,
    train_dataset=dataset,
    args=training_args,
)

# Start training
trainer.train()

wandb.finish()

ds_config_zero1.json

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "zero_optimization": {
        "stage": 1,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        }
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Code Differences

Environment variables for single GPU DeepSpeed are added.
The training configuration now includes deepspeed="ds_config_zero1.json" to specify the DeepSpeed settings.

Performance Comparison

(I`m new user , so I can only post 2 URL, please add “h” to start of Colab URL)

Experiment	DeepSpeed	Training Time (sec)	Max VRAM (GB)	Colab URL
exp001	Not used	1384	11.4	ttps://colab.research.google.com/drive/1BcNi9NZSICqk0cLlDlkWaS_JwyqgX9oa
exp002	Not used (only installed)	1389	11.4	ttps://colab.research.google.com/drive/1tnIrsfRiaxDiwzJHdZ_AJw1Rv9j8KKwg
exp003	ZeRO-1	692	9.4	ttps://colab.research.google.com/drive/1BR9B4nhACP1iRHJjXiYWnqogA78Wi3uq

Wandb logs (exp001–003)

John6666 · March 7, 2025, 6:22pm

It’s true that it’s not recommended for single GPUs…
It seems that you can report issues with the Hub and documentation in the following issue.

Topic		Replies	Views
DeepSpeed integration for HuggingFace Seq2SeqTrainingArguments DeepSpeed	0	1482	February 22, 2024
[Solved] Cannot restart training from deepspeed checkpoint Intermediate	3	2674	December 28, 2023
Basics for Multi GPU Training with Huggingface Trainer 🤗Transformers	0	2683	June 14, 2023
SFTTrainer training very slow on GPU. Is this training speed expected? 🤗Transformers	4	294	February 8, 2025
Model Parallelism, how to parallelize transformer? Beginners	3	12719	June 18, 2021