I’m currently training a multi-gpu single node model. I am using an IterableDataset in streaming mode. However, after one epoch it hangs then crashes after a timeout. Doing extensive googling I’ve come to believe that it’s the way I’m using split_dataset_by_node
and shuffle
. I believe the GPU’s are getting different sizes of batches.
Here’s my full dataloading/dataset code.
def load_train_objs(dataset_path):
dataset_path_dict = {
"train": train_files,
"val": val_files,
}
dataset = load_dataset("parquet", data_files=dataset_path_dict, streaming=True)
dataset = dataset.shuffle(seed=seed)
...
return dataset
def prepare_data_loader(dataset, batch_size, rank, world_size):
train_ds = dataset["train"]
train_ds = split_dataset_by_node(train_ds, rank=rank, world_size=world_size)
val_ds = dataset["val"]
val_ds = split_dataset_by_node(val_ds, rank=rank, world_size=world_size)
train_dl = DataLoader(
train_ds,
batch_size,
num_workers=8,
drop_last=True,
collate_fn=collate_fn,
)
val_dl = DataLoader(val_ds, batch_size, drop_last=False, collate_fn=collate_fn)
return train_dl, val_dl
I’m not sure what at what point I need to use shuffle?
Should it be where it currently is or should it be after
train_ds = dataset["train"]
or even after
train_ds = split_dataset_by_node(train_ds, rank=rank, world_size=world_size)
Any help would be greatly appreciated