How to use split_dataset_by_node and shuffle on iterable dataset

I’m currently training a multi-gpu single node model. I am using an IterableDataset in streaming mode. However, after one epoch it hangs then crashes after a timeout. Doing extensive googling I’ve come to believe that it’s the way I’m using split_dataset_by_node and shuffle. I believe the GPU’s are getting different sizes of batches.

Here’s my full dataloading/dataset code.

 def load_train_objs(dataset_path):
     dataset_path_dict = {
        "train": train_files,
        "val": val_files,
    }

    dataset = load_dataset("parquet", data_files=dataset_path_dict, streaming=True)
    dataset = dataset.shuffle(seed=seed)
    ...
    return dataset

def prepare_data_loader(dataset, batch_size, rank, world_size):
    train_ds = dataset["train"]
    train_ds = split_dataset_by_node(train_ds, rank=rank, world_size=world_size)

    val_ds = dataset["val"]
    val_ds = split_dataset_by_node(val_ds, rank=rank, world_size=world_size)

    train_dl = DataLoader(
        train_ds,
        batch_size,
        num_workers=8,
        drop_last=True,
        collate_fn=collate_fn,
    )

    val_dl = DataLoader(val_ds, batch_size, drop_last=False, collate_fn=collate_fn)

    return train_dl, val_dl

I’m not sure what at what point I need to use shuffle?
Should it be where it currently is or should it be after
train_ds = dataset["train"]
or even after
train_ds = split_dataset_by_node(train_ds, rank=rank, world_size=world_size)

Any help would be greatly appreciated :slight_smile:

Hi ! split_dataset_by_node may not return datasets with the same size, so your training may time out waiting when one node runs out of examples

1 Like