How to use split_dataset_by_node and shuffle on iterable dataset

I’m currently training a multi-gpu single node model. I am using an IterableDataset in streaming mode. However, after one epoch it hangs then crashes after a timeout. Doing extensive googling I’ve come to believe that it’s the way I’m using split_dataset_by_node and shuffle. I believe the GPU’s are getting different sizes of batches.

Here’s my full dataloading/dataset code.

 def load_train_objs(dataset_path):
     dataset_path_dict = {
        "train": train_files,
        "val": val_files,
    }

    dataset = load_dataset("parquet", data_files=dataset_path_dict, streaming=True)
    dataset = dataset.shuffle(seed=seed)
    ...
    return dataset

def prepare_data_loader(dataset, batch_size, rank, world_size):
    train_ds = dataset["train"]
    train_ds = split_dataset_by_node(train_ds, rank=rank, world_size=world_size)

    val_ds = dataset["val"]
    val_ds = split_dataset_by_node(val_ds, rank=rank, world_size=world_size)

    train_dl = DataLoader(
        train_ds,
        batch_size,
        num_workers=8,
        drop_last=True,
        collate_fn=collate_fn,
    )

    val_dl = DataLoader(val_ds, batch_size, drop_last=False, collate_fn=collate_fn)

    return train_dl, val_dl

I’m not sure what at what point I need to use shuffle?
Should it be where it currently is or should it be after
train_ds = dataset["train"]
or even after
train_ds = split_dataset_by_node(train_ds, rank=rank, world_size=world_size)

Any help would be greatly appreciated :slight_smile:

2 Likes

Hi ! split_dataset_by_node may not return datasets with the same size, so your training may time out waiting when one node runs out of examples

2 Likes

When I use .shuffle() after split_dataset_by_node, I see same examples on different ranks in the same epoch. I could not find any documentation for this scenario.

Update: I tried using shuffle before split_dataset_by_node and got the same results. The examples are repeated in the same epoch across ranks.

Update: The issue was that I was giving rank dependent seed to shuffle. If I give the same seed to shuffle on all ranks, then the problem goes away.

3 Likes

Thanks for sharing. It helps me a lot.

Hi! When `dataset.num_shards % world_size == 0`, each node will have the same size, right?

1 Like

When `dataset.num_shards % world_size == 0`, each node will have the same size, right?

In this case each node has the same number of shards to load. This allows each node to have roughly the same amount of data. But it’s likely that the shards don’t have the exact same number of examples

2 Likes