I鈥檓 reading data using stream and I need to pass the data to a pipeline which is ran in distributed manner, where each processes is expected to handle different batch of data.
when I tried the following
dataset = load_dataset("oscar-corpus/OSCAR-2301",
token= token
language="ar",
streaming=True
split="train",
)
dataloader= iter(DataLoader(dataset, num_workers=5,batch_size = 1000,collate_fn = lambda x: [i for i in x]))
run_pipes(
inputs= dataloader, # any inputs of type Iterable
)
it didn鈥檛 work, the dataloader was replicated across processes, and processes ended up with the same batch of data.