### System Info
```Shell
- `Accelerate` version: 1.2.0
- Platform: Linux-5.15.0…-116-generic-x86_64-with-glibc2.31
- `accelerate` bash location: /home/zhoust/miniconda3/envs/toolkit/bin/accelerate
- Python version: 3.12.2
- Numpy version: 2.2.0
- PyTorch version (GPU?): 2.5.1+cu124 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 1007.50 GB
- GPU type: NVIDIA RTX A6000
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 2
- machine_rank: 0
- num_machines: 1
- gpu_ids: 0,1
- rdzv_backend: static
- same_network: True
- main_training_function: main
- enable_cpu_affinity: False
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
```
### Information
- [ ] The official example scripts
- [x] My own modified scripts
### Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported `no_trainer` script in the `examples` folder of the `transformers` repo (such as `run_no_trainer_glue.py`)
- [x] My own task or dataset (give details below)
### Reproduction
```python
from accelerate import Accelerator
from torch.utils.data import DataLoader, Dataset
from torch.utils.data.distributed import DistributedSampler
import torch
import numpy as np
# Dummy dataset
class DummyDataset(Dataset):
def __init__(self, length=1000):
self.data = np.arange(length)
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
return {
"input": torch.tensor(self.data[idx], dtype=torch.float32)
}
# Configs
BATCH_SIZE = 32
TOTAL_SAMPLES = 128
GRAD_ACC_STEPS = 2
# Initialize Accelerator
accelerator = Accelerator(gradient_accumulation_steps=GRAD_ACC_STEPS)
# Dataset and Dataloader
dataset = DummyDataset(length=TOTAL_SAMPLES)
dataloader = DataLoader(dataset, batch_size=BATCH_SIZE, sampler=DistributedSampler(dataset, num_replicas=accelerator.num_processes, shuffle=True))
accelerator.print(f"Dataloader length (this process): {len(dataloader)}")
# Prepare with accelerator
dataloader = accelerator.prepare(dataloader)
# Info logging
accelerator.print("=" * 50)
accelerator.print(f"Device: {accelerator.device}")
accelerator.print(f"Mixed Precision: {accelerator.mixed_precision}")
accelerator.print(f"Num processes (world size): {accelerator.num_processes}")
accelerator.print(f"Gradient accumulation steps: {GRAD_ACC_STEPS}")
accelerator.print(f"Batch size per process: {BATCH_SIZE}")
accelerator.print(f"Total dataset size: {TOTAL_SAMPLES}")
accelerator.print(f"Dataloader length (this process): {len(dataloader)}")
accelerator.print(f"Global steps per epoch: {len(dataloader) * accelerator.num_processes}")
accelerator.print(f"Effective batch size: {BATCH_SIZE * accelerator.num_processes * GRAD_ACC_STEPS}")
accelerator.print("=" * 50)
# Iterate through dataloader
for i, batch in enumerate(dataloader):
with accelerator.accumulate(None):
# Simulate training step
input_data = batch["input"]
print(f'[{accelerator.device}] {input_data.detach().cpu().numpy().tolist()}')
accelerator.print(f"[Batch {i}] batch['input'].shape: {batch['input'].shape}, dataloader length: {len(dataloader)}")
```
### Expected behavior
The output should be as follows:
```
Dataloader length (this process): 2
==================================================
Device: cuda:0
Mixed Precision: bf16
Num processes (world size): 2
Gradient accumulation steps: 2
Batch size per process: 32
Total dataset size: 128
Dataloader length (this process): 1
Global steps per epoch: 2
Effective batch size: 128
==================================================
[cuda:0] [44.0, 121.0, 31.0, 71.0, 105.0, 15.0, 48.0, 30.0, 104.0, 72.0, 118.0, 53.0, 18.0, 33.0, 10.0, 101.0, 115.0, 35.0, 74.0, 37.0, 114.0, 41.0, 99.0, 26.0, 110.0, 66.0, 65.0, 23.0, 79.0, 36.0, 63.0, 52.0]
[Batch 0] batch['input'].shape: torch.Size([32]), dataloader length: 1
[cuda:1] [56.0, 46.0, 17.0, 1.0, 54.0, 87.0, 73.0, 112.0, 55.0, 19.0, 103.0, 59.0, 20.0, 108.0, 96.0, 78.0, 61.0, 64.0, 113.0, 89.0, 91.0, 62.0, 42.0, 24.0, 93.0, 100.0, 80.0, 85.0, 43.0, 9.0, 7.0, 83.0]
```
### Issues
- When initiated with distributed sampler, data was split into two shards ( 2 process).
- After the `accelerator.prepare`, the length of dataloader got split again.
- In the end, only **_half of the samples_** where iterated in one loop of dataloader, as shown in last three lines.