How to use split_dataset_by_node and shuffle on iterable dataset

irow · August 16, 2024, 3:46pm

I’m currently training a multi-gpu single node model. I am using an IterableDataset in streaming mode. However, after one epoch it hangs then crashes after a timeout. Doing extensive googling I’ve come to believe that it’s the way I’m using split_dataset_by_node and shuffle. I believe the GPU’s are getting different sizes of batches.

Here’s my full dataloading/dataset code.

 def load_train_objs(dataset_path):
     dataset_path_dict = {
        "train": train_files,
        "val": val_files,
    }

    dataset = load_dataset("parquet", data_files=dataset_path_dict, streaming=True)
    dataset = dataset.shuffle(seed=seed)
    ...
    return dataset

def prepare_data_loader(dataset, batch_size, rank, world_size):
    train_ds = dataset["train"]
    train_ds = split_dataset_by_node(train_ds, rank=rank, world_size=world_size)

    val_ds = dataset["val"]
    val_ds = split_dataset_by_node(val_ds, rank=rank, world_size=world_size)

    train_dl = DataLoader(
        train_ds,
        batch_size,
        num_workers=8,
        drop_last=True,
        collate_fn=collate_fn,
    )

    val_dl = DataLoader(val_ds, batch_size, drop_last=False, collate_fn=collate_fn)

    return train_dl, val_dl

I’m not sure what at what point I need to use shuffle?
Should it be where it currently is or should it be after
train_ds = dataset["train"]
or even after
train_ds = split_dataset_by_node(train_ds, rank=rank, world_size=world_size)

Any help would be greatly appreciated

lhoestq · August 19, 2024, 1:24pm

Hi ! split_dataset_by_node may not return datasets with the same size, so your training may time out waiting when one node runs out of examples

dhruveshpatel · February 16, 2025, 1:42am

When I use .shuffle() after split_dataset_by_node, I see same examples on different ranks in the same epoch. I could not find any documentation for this scenario.

Update: I tried using shuffle before split_dataset_by_node and got the same results. The examples are repeated in the same epoch across ranks.

Update: The issue was that I was giving rank dependent seed to shuffle. If I give the same seed to shuffle on all ranks, then the problem goes away.

benstokes · February 17, 2025, 3:28am

Thanks for sharing. It helps me a lot.

Zoe0427 · September 13, 2025, 3:21am

Hi! When `dataset.num_shards % world_size == 0`, each node will have the same size, right?

lhoestq · September 15, 2025, 8:43am

When `dataset.num_shards % world_size == 0`, each node will have the same size, right?

In this case each node has the same number of shards to load. This allows each node to have roughly the same amount of data. But it’s likely that the shards don’t have the exact same number of examples

Topic		Replies	Views
Keeping IterableDataset node-wise split fixed during DDP 🤗Datasets	8	2062	April 29, 2024
Should I shard dataset in distributed training? 🤗Datasets	2	716	December 3, 2021
MutliGPU Training using split_dataset_per_node with PyTorch Lightning 🤗Datasets	1	772	May 24, 2024
Making an infinite IterableDataset 🤗Datasets	6	178	March 19, 2025
How to handle streaming datasets with DDP? 🤗Datasets	1	591	January 28, 2024

How to use split_dataset_by_node and shuffle on iterable dataset

Related topics