SFTTrainer Doubling Speed on a Single GPU with DeepSpeed: Proposal for an Update to the Official Documentation and Verification Report

Official Hugging Face Transformers documentation states that

“if your model fits onto a single GPU and you have enough space to fit a small batch size, you don’t need to use DeepSpeed as it’ll only slow things down.”

However, our experiments have shown that even in a single GPU environment (e.g., Colab), incorporating DeepSpeed can lead to a 2.0× speedup.

In this topic, I share the steps and verification results that suggest the documentation could be updated to reflect these findings.

In our tests using the Trainer (or SFTTrainer from trl), simply adding about six lines of code to enable DeepSpeed yielded nearly half the training time and reduced VRAM usage, all without significantly altering existing pipelines.

  • For reproducibility and result verification, please refer to the Colab and Wandb links.
  • Environment

    • GPU: Google Colab (L4)
    • Model: Qwen2.5-7B-Instruct
    • Dataset: stanfordnlp/imdb (using 1,000 samples)
  • Comparison Conditions

    1. Without DeepSpeed (without installing mpi4py or deepspeed)
    2. Without DeepSpeed (with mpi4py and deepspeed installed)
    3. With DeepSpeed enabled (using ZeRO-1)

Code Example

Installation of Libraries

!pip install -q huggingface_hub==0.29.1
!pip install -q transformers==4.49.0
!pip install -q bitsandbytes==0.45.3
!pip install -q peft==0.14.0
!pip install -q accelerate==1.4.0
!pip install -q datasets==3.3.2
!pip install -q trl==0.15.2
!pip install -q mpi4py==4.0.3  # Comment out in exp001
!pip install -q deepspeed==0.16.4  # Comment out in exp001
!pip install -q flash-attn==2.7.4.post1 --no-build-isolation

Without DeepSpeed (exp001 / exp002)

import os

import wandb
import torch
from datasets import load_dataset
from huggingface_hub import snapshot_download
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from trl import SFTConfig, SFTTrainer

# Wandb configuration
wandb.login()
os.environ["WANDB_PROJECT"] = "1gpu-deepspeed"

# Download model
model_name = "Qwen/Qwen2.5-7B-Instruct"
snapshot_download(repo_id=model_name, local_dir_use_symlinks=False, revision="main")

# Load dataset
dataset = load_dataset("stanfordnlp/imdb", split="train")
dataset = dataset.select(range(1000))

run_name = "exp001_baseline"  # Alternatively "exp002_only_pip_install_deepspeed"

# Training configuration
training_args = SFTConfig(
    max_seq_length=512,
    run_name=run_name,
    output_dir="/tmp",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    optim='adamw_torch',
    logging_steps=1,
    learning_rate=1e-4,
    weight_decay=0.01,
    lr_scheduler_type="cosine",
    seed=1024,
    bf16=True,
)

# LoRA configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    target_modules="all-linear",
    task_type=TaskType.CAUSAL_LM,
)

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2"
)
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
model = get_peft_model(model, lora_config)

trainer = SFTTrainer(
    model,
    train_dataset=dataset,
    args=training_args,
)

# Start training
trainer.train()

wandb.finish()

With DeepSpeed Enabled (exp003)

import os

import wandb
import torch
from datasets import load_dataset
from huggingface_hub import snapshot_download
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from trl import SFTConfig, SFTTrainer

# Set environment variables for DeepSpeed distributed training (for single GPU)
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "9994"
os.environ["RANK"] = "0"
os.environ["LOCAL_RANK"] = "0"
os.environ["WORLD_SIZE"] = "1"

# Wandb configuration
wandb.login()
os.environ["WANDB_PROJECT"] = "1gpu-deepspeed"

# Download model
model_name = "Qwen/Qwen2.5-7B-Instruct"
snapshot_download(repo_id=model_name, local_dir_use_symlinks=False, revision="main")

# Load dataset
dataset = load_dataset("stanfordnlp/imdb", split="train")
dataset = dataset.select(range(1000))

run_name = "exp003_ZeRO-1"

# Training configuration (with deepspeed argument)
training_args = SFTConfig(
    max_seq_length=512,
    run_name=run_name,
    output_dir="/tmp",
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    optim='adamw_torch',
    logging_steps=1,
    learning_rate=1e-4,
    weight_decay=0.01,
    lr_scheduler_type="cosine",
    seed=1024,
    bf16=True,
    deepspeed="ds_config_zero1.json",  # Specify the DeepSpeed configuration file
)

# LoRA configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    target_modules="all-linear",
    task_type=TaskType.CAUSAL_LM,
)

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2"
)
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
model = get_peft_model(model, lora_config)

trainer = SFTTrainer(
    model,
    train_dataset=dataset,
    args=training_args,
)

# Start training
trainer.train()

wandb.finish()

ds_config_zero1.json

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "bf16": {
        "enabled": "auto"
    },
    "zero_optimization": {
        "stage": 1,
        "offload_optimizer": {
            "device": "none",
            "pin_memory": true
        }
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Code Differences

  • Environment variables for single GPU DeepSpeed are added.
  • The training configuration now includes deepspeed="ds_config_zero1.json" to specify the DeepSpeed settings.

Performance Comparison

(I`m new user , so I can only post 2 URL, please add “h” to start of Colab URL)

Experiment DeepSpeed Training Time (sec) Max VRAM (GB) Colab URL
exp001 Not used 1384 11.4 ttps://colab.research.google.com/drive/1BcNi9NZSICqk0cLlDlkWaS_JwyqgX9oa
exp002 Not used (only installed) 1389 11.4 ttps://colab.research.google.com/drive/1tnIrsfRiaxDiwzJHdZ_AJw1Rv9j8KKwg
exp003 ZeRO-1 692 9.4 ttps://colab.research.google.com/drive/1BR9B4nhACP1iRHJjXiYWnqogA78Wi3uq

Wandb logs (exp001–003)

1 Like

It’s true that it’s not recommended for single GPUs…
It seems that you can report issues with the Hub and documentation in the following issue.