Official Hugging Face Transformers documentation states that
“if your model fits onto a single GPU and you have enough space to fit a small batch size, you don’t need to use DeepSpeed as it’ll only slow things down.”
However, our experiments have shown that even in a single GPU environment (e.g., Colab), incorporating DeepSpeed can lead to a 2.0× speedup.
In this topic, I share the steps and verification results that suggest the documentation could be updated to reflect these findings.
In our tests using the Trainer (or SFTTrainer from trl), simply adding about six lines of code to enable DeepSpeed yielded nearly half the training time and reduced VRAM usage, all without significantly altering existing pipelines.
- For reproducibility and result verification, please refer to the Colab and Wandb links.
-
Environment
- GPU: Google Colab (L4)
- Model: Qwen2.5-7B-Instruct
- Dataset: stanfordnlp/imdb (using 1,000 samples)
-
Comparison Conditions
- Without DeepSpeed (without installing mpi4py or deepspeed)
- Without DeepSpeed (with mpi4py and deepspeed installed)
- With DeepSpeed enabled (using ZeRO-1)
Code Example
Installation of Libraries
!pip install -q huggingface_hub==0.29.1
!pip install -q transformers==4.49.0
!pip install -q bitsandbytes==0.45.3
!pip install -q peft==0.14.0
!pip install -q accelerate==1.4.0
!pip install -q datasets==3.3.2
!pip install -q trl==0.15.2
!pip install -q mpi4py==4.0.3 # Comment out in exp001
!pip install -q deepspeed==0.16.4 # Comment out in exp001
!pip install -q flash-attn==2.7.4.post1 --no-build-isolation
Without DeepSpeed (exp001 / exp002)
import os
import wandb
import torch
from datasets import load_dataset
from huggingface_hub import snapshot_download
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from trl import SFTConfig, SFTTrainer
# Wandb configuration
wandb.login()
os.environ["WANDB_PROJECT"] = "1gpu-deepspeed"
# Download model
model_name = "Qwen/Qwen2.5-7B-Instruct"
snapshot_download(repo_id=model_name, local_dir_use_symlinks=False, revision="main")
# Load dataset
dataset = load_dataset("stanfordnlp/imdb", split="train")
dataset = dataset.select(range(1000))
run_name = "exp001_baseline" # Alternatively "exp002_only_pip_install_deepspeed"
# Training configuration
training_args = SFTConfig(
max_seq_length=512,
run_name=run_name,
output_dir="/tmp",
num_train_epochs=1,
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
optim='adamw_torch',
logging_steps=1,
learning_rate=1e-4,
weight_decay=0.01,
lr_scheduler_type="cosine",
seed=1024,
bf16=True,
)
# LoRA configuration
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
target_modules="all-linear",
task_type=TaskType.CAUSAL_LM,
)
# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
# Load model
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2"
)
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
model = get_peft_model(model, lora_config)
trainer = SFTTrainer(
model,
train_dataset=dataset,
args=training_args,
)
# Start training
trainer.train()
wandb.finish()
With DeepSpeed Enabled (exp003)
import os
import wandb
import torch
from datasets import load_dataset
from huggingface_hub import snapshot_download
from peft import LoraConfig, get_peft_model, TaskType, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from trl import SFTConfig, SFTTrainer
# Set environment variables for DeepSpeed distributed training (for single GPU)
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "9994"
os.environ["RANK"] = "0"
os.environ["LOCAL_RANK"] = "0"
os.environ["WORLD_SIZE"] = "1"
# Wandb configuration
wandb.login()
os.environ["WANDB_PROJECT"] = "1gpu-deepspeed"
# Download model
model_name = "Qwen/Qwen2.5-7B-Instruct"
snapshot_download(repo_id=model_name, local_dir_use_symlinks=False, revision="main")
# Load dataset
dataset = load_dataset("stanfordnlp/imdb", split="train")
dataset = dataset.select(range(1000))
run_name = "exp003_ZeRO-1"
# Training configuration (with deepspeed argument)
training_args = SFTConfig(
max_seq_length=512,
run_name=run_name,
output_dir="/tmp",
num_train_epochs=1,
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
optim='adamw_torch',
logging_steps=1,
learning_rate=1e-4,
weight_decay=0.01,
lr_scheduler_type="cosine",
seed=1024,
bf16=True,
deepspeed="ds_config_zero1.json", # Specify the DeepSpeed configuration file
)
# LoRA configuration
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
target_modules="all-linear",
task_type=TaskType.CAUSAL_LM,
)
# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
# Load model
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2"
)
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
model = get_peft_model(model, lora_config)
trainer = SFTTrainer(
model,
train_dataset=dataset,
args=training_args,
)
# Start training
trainer.train()
wandb.finish()
ds_config_zero1.json
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 1,
"offload_optimizer": {
"device": "none",
"pin_memory": true
}
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
Code Differences
- Environment variables for single GPU DeepSpeed are added.
- The training configuration now includes
deepspeed="ds_config_zero1.json"
to specify the DeepSpeed settings.
Performance Comparison
(I`m new user , so I can only post 2 URL, please add “h” to start of Colab URL)
Experiment | DeepSpeed | Training Time (sec) | Max VRAM (GB) | Colab URL |
---|---|---|---|---|
exp001 | Not used | 1384 | 11.4 | ttps://colab.research.google.com/drive/1BcNi9NZSICqk0cLlDlkWaS_JwyqgX9oa |
exp002 | Not used (only installed) | 1389 | 11.4 | ttps://colab.research.google.com/drive/1tnIrsfRiaxDiwzJHdZ_AJw1Rv9j8KKwg |
exp003 | ZeRO-1 | 692 | 9.4 | ttps://colab.research.google.com/drive/1BR9B4nhACP1iRHJjXiYWnqogA78Wi3uq |