Question met when using DeepSpeed ZeRO3 AMP for code testing on simple pytorch examples

BeefWellington · July 24, 2024, 4:51pm

fp16 or fp32 are completely smooth, but bf16 does not work. It will raise an error that shows input type(float) is not the same as the bias type(bfloat16).

The code is shown below:

import time
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from accelerate import Accelerator


seed_value = 42
torch.manual_seed(seed_value)

accelerator = Accelerator()
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

batch_size = 512

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                         shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')


class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = torch.flatten(x, 1) # flatten all dimensions except batch
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


model = Net()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

model, optimizer, trainloader = accelerator.prepare(model, optimizer, trainloader)
accelerator.print(model.__class__.__name__)

for epoch in range(20):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        start_time = time.time()
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data
        with accelerator.accumulate():
            # zero the parameter gradients
            optimizer.zero_grad()
            # forward + backward + optimize
            outputs = model(inputs)
            loss = criterion(outputs, labels)

            accelerator.backward(loss)
            accelerator.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()

        # print statistics
        running_loss += loss.item()
        if (i+1) % 10 == 0:
            end_time = time.time()
            time_spent = end_time - start_time
            # print every 2000 mini-batches
            accelerator.print(f'[Epoch{epoch + 1}, Step{i + 1:5d}] loss: {running_loss / 10:.5f}, time_spent: {time_spent:.5f}')
            running_loss = 0.0

print('Finished Training')

The accelerate config is shown as below:

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 1
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero3_save_16bit_model: true
  zero_stage: 3

distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Topic		Replies	Views
Explicitly disable bf16 for some layers 🤗Transformers	2	18	June 17, 2025
Enabling gradient checkpointing and deepspeed ZeRO3 raise train failure 🧨 Diffusers	1	2614	May 25, 2024
DeepSpeed Zero3 and Peft LoRA fp16 issue Intermediate	3	3007	May 24, 2023
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! 🤗Accelerate	1	766	May 31, 2024
Saving bf16 Model Weights When Using Accelerate+DeepSpeed 🤗Accelerate	4	432	March 17, 2025

Question met when using DeepSpeed ZeRO3 AMP for code testing on simple pytorch examples

Related topics