DistributedSampler with Accelerate

zaryabmakram · June 10, 2025, 3:02pm

I have noticed that using torch.utils.data.distributed.DistributedSampler as the sampler in the dataloader, accelerate changes its length after calling accelerate.prepare().

Following is how Im setting up the dataloader:

# build data loader
sampler = torch.utils.data.distributed.DistributedSampler(
    train_dataset, 
    shuffle=True,
)

self.train_loader = torch.utils.data.DataLoader(
    train_dataset,
    shuffle=(sampler is None),
    sampler=sampler,
    batch_size=10,
    drop_last=True,
    num_workers=8,
    pin_memory=True,
    persistent_workers=True,
    prefetch_factor=2,
)


...

# prepare it with accelerate
self.model, self.optimizer, self.train_loader, self.val_loader = self.accelerator.prepare(
    self.model, 
    self.optimizer, 
    self.train_loader, 
    self.val_loader,
)

Following is lengths of the dataset Im getting:

Before Calling accelerator.prepare

len(self.train_loader) = 12 (which is expected)

Before Calling accelerator.prepare

len(self.train_loader) = 1

Can someone suggest what the issue could be?

Accelerate version being used is 1.5.2

John6666 · June 10, 2025, 4:15pm

This unsolved issue may have similar symptoms…

github.com/huggingface/accelerate

Dataloader length got split twice with DistributedSampler and Accelerate

opened 02:41PM - 19 Apr 25 UTC

closed 03:07PM - 28 May 25 UTC

songtoy

### System Info ```Shell - `Accelerate` version: 1.2.0 - Platform: Linux-5.15.0…-116-generic-x86_64-with-glibc2.31 - `accelerate` bash location: /home/zhoust/miniconda3/envs/toolkit/bin/accelerate - Python version: 3.12.2 - Numpy version: 2.2.0 - PyTorch version (GPU?): 2.5.1+cu124 (True) - PyTorch XPU available: False - PyTorch NPU available: False - PyTorch MLU available: False - PyTorch MUSA available: False - System RAM: 1007.50 GB - GPU type: NVIDIA RTX A6000 - `Accelerate` default config: - compute_environment: LOCAL_MACHINE - distributed_type: MULTI_GPU - mixed_precision: bf16 - use_cpu: False - debug: False - num_processes: 2 - machine_rank: 0 - num_machines: 1 - gpu_ids: 0,1 - rdzv_backend: static - same_network: True - main_training_function: main - enable_cpu_affinity: False - downcast_bf16: no - tpu_use_cluster: False - tpu_use_sudo: False - tpu_env: [] ``` ### Information - [ ] The official example scripts - [x] My own modified scripts ### Tasks - [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported `no_trainer` script in the `examples` folder of the `transformers` repo (such as `run_no_trainer_glue.py`) - [x] My own task or dataset (give details below) ### Reproduction ```python from accelerate import Accelerator from torch.utils.data import DataLoader, Dataset from torch.utils.data.distributed import DistributedSampler import torch import numpy as np # Dummy dataset class DummyDataset(Dataset): def __init__(self, length=1000): self.data = np.arange(length) def __len__(self): return len(self.data) def __getitem__(self, idx): return { "input": torch.tensor(self.data[idx], dtype=torch.float32) } # Configs BATCH_SIZE = 32 TOTAL_SAMPLES = 128 GRAD_ACC_STEPS = 2 # Initialize Accelerator accelerator = Accelerator(gradient_accumulation_steps=GRAD_ACC_STEPS) # Dataset and Dataloader dataset = DummyDataset(length=TOTAL_SAMPLES) dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, sampler=DistributedSampler(dataset, num_replicas=accelerator.num_processes, shuffle=True)) accelerator.print(f"Dataloader length (this process): {len(dataloader)}") # Prepare with accelerator dataloader = accelerator.prepare(dataloader) # Info logging accelerator.print("=" * 50) accelerator.print(f"Device: {accelerator.device}") accelerator.print(f"Mixed Precision: {accelerator.mixed_precision}") accelerator.print(f"Num processes (world size): {accelerator.num_processes}") accelerator.print(f"Gradient accumulation steps: {GRAD_ACC_STEPS}") accelerator.print(f"Batch size per process: {BATCH_SIZE}") accelerator.print(f"Total dataset size: {TOTAL_SAMPLES}") accelerator.print(f"Dataloader length (this process): {len(dataloader)}") accelerator.print(f"Global steps per epoch: {len(dataloader) * accelerator.num_processes}") accelerator.print(f"Effective batch size: {BATCH_SIZE * accelerator.num_processes * GRAD_ACC_STEPS}") accelerator.print("=" * 50) # Iterate through dataloader for i, batch in enumerate(dataloader): with accelerator.accumulate(None): # Simulate training step input_data = batch["input"] print(f'[{accelerator.device}] {input_data.detach().cpu().numpy().tolist()}') accelerator.print(f"[Batch {i}] batch['input'].shape: {batch['input'].shape}, dataloader length: {len(dataloader)}") ``` ### Expected behavior The output should be as follows: ``` Dataloader length (this process): 2 ================================================== Device: cuda:0 Mixed Precision: bf16 Num processes (world size): 2 Gradient accumulation steps: 2 Batch size per process: 32 Total dataset size: 128 Dataloader length (this process): 1 Global steps per epoch: 2 Effective batch size: 128 ================================================== [cuda:0] [44.0, 121.0, 31.0, 71.0, 105.0, 15.0, 48.0, 30.0, 104.0, 72.0, 118.0, 53.0, 18.0, 33.0, 10.0, 101.0, 115.0, 35.0, 74.0, 37.0, 114.0, 41.0, 99.0, 26.0, 110.0, 66.0, 65.0, 23.0, 79.0, 36.0, 63.0, 52.0] [Batch 0] batch['input'].shape: torch.Size([32]), dataloader length: 1 [cuda:1] [56.0, 46.0, 17.0, 1.0, 54.0, 87.0, 73.0, 112.0, 55.0, 19.0, 103.0, 59.0, 20.0, 108.0, 96.0, 78.0, 61.0, 64.0, 113.0, 89.0, 91.0, 62.0, 42.0, 24.0, 93.0, 100.0, 80.0, 85.0, 43.0, 9.0, 7.0, 83.0] ``` ### Issues - When initiated with distributed sampler, data was split into two shards ( 2 process). - After the `accelerator.prepare`, the length of dataloader got split again. - In the end, only **_half of the samples_** where iterated in one loop of dataloader, as shown in last three lines.

Topic		Replies	Views
Using DistributedSampler with accelerate 🤗Transformers	4	236	April 2, 2025
Accelerator .prepare() replaces custom DataLoader Sampler 🤗Accelerate	5	1312	March 9, 2025
Can accelerator handle the distributed sampler? 🤗Accelerate	2	2961	December 21, 2021
Troubles with features in .prepare() 🤗Accelerate	1	35	November 30, 2024
DataLoader from accelerator samples from beginning of dataset for last batch 🤗Accelerate	1	666	January 15, 2024

DistributedSampler with Accelerate

Before Calling accelerator.prepare

Before Calling accelerator.prepare

Related topics