My application relies on Dataset to manage some data.
We have multiple workers loading the same Dataset to do some computation.
Currently, a single worker writes to the Dataset and adds columns to the table.
I need the other workers to have access to the latest data.
In the past, we were able to overwrite the dataset and the workers would simply reload the new version. But we get an error as of 1.16 saying that we can’t overwrite a dataset.
What is the best way to do this? For now, I can disable caching to avoid the error.
Hi ! Indeed it’s not possible to write to an opened dataset anymore (or you may corrupt its data).
Depending on your distributed setup you can either pickle the dataset from one worker to the other (it only pickles the path to the local arrow files to be reloaded, not the actual data), or save the dataset to a new location using save_to_disk and make the other workers reload the dataset from there.
In a distributed setup, you can split a dataset by node using
import os
from datasets.distributed import split_dataset_by_node
ds = split_dataset_by_node(ds, rank=int(os.environ["RANK"]), world_size=int(os.environ["WORLD_SIZE"]))
This can be used to train models in distributed setup, with native support of the PyToch DataLoader
But if you wish to process the dataset, then each node can write data in separated directories, which can be reloaded as datasets and concatenated later.