How to use Datasets in a distributed system?

Dref360 · January 6, 2022, 1:58pm

Hello,

My application relies on Dataset to manage some data.
We have multiple workers loading the same Dataset to do some computation.

Currently, a single worker writes to the Dataset and adds columns to the table.
I need the other workers to have access to the latest data.

In the past, we were able to overwrite the dataset and the workers would simply reload the new version. But we get an error as of 1.16 saying that we can’t overwrite a dataset.

What is the best way to do this? For now, I can disable caching to avoid the error.

lhoestq · January 10, 2022, 11:36am

Hi ! Indeed it’s not possible to write to an opened dataset anymore (or you may corrupt its data).

Depending on your distributed setup you can either pickle the dataset from one worker to the other (it only pickles the path to the local arrow files to be reloaded, not the actual data), or save the dataset to a new location using save_to_disk and make the other workers reload the dataset from there.

Dref360 · January 13, 2022, 11:41pm

Thanks for the reply.

I proposed a solution in Ability to split a dataset in multiple files. · Issue #3544 · huggingface/datasets · GitHub

In brief, we would edit state.json to keep track of the new columns added as files. What do you think?

Note that this does not solve the issue when we update a value in the dataset.

I am currently using a versioning mechanism every time I modify the dataset and the workers load the latest version.

Open to suggestions

lhoestq · March 18, 2023, 1:04pm

Quick update

In a distributed setup, you can split a dataset by node using

import os
from datasets.distributed import split_dataset_by_node

ds = split_dataset_by_node(ds, rank=int(os.environ["RANK"]), world_size=int(os.environ["WORLD_SIZE"]))

This can be used to train models in distributed setup, with native support of the PyToch DataLoader

But if you wish to process the dataset, then each node can write data in separated directories, which can be reloaded as datasets and concatenated later.

Dref360 · March 22, 2023, 2:07pm

Hello!

Thanks for the update, in the end we did the following:

Once a worker is done with a task, it writes a new version of the dataset
When a worker gets a new job, it grabs the latest version.

This fixes our issue, if a worker is using an old version of the dataset, there is no problem as we are always writing to a new version.

Now that my project is open source, I can better describe the solution.

We can save the dataset to a new version path with _get_new_version_path
We can load the latest cache with load_cache

While not optimal, it fixes our current problem.

Thank you!

Topic		Replies	Views
Loading webdatasets across multiple nodes 🤗Datasets	3	1536	April 21, 2025
How to save datasets as distributed with save_to_disk? 🤗Datasets	1	2470	November 15, 2022
Distributed data sampling for streaming 🤗Datasets	2	1791	October 4, 2023
How to overwrite dataset with dataset.push_to_hub() or alternative? 🤗Hub	3	2280	September 20, 2023
Support of very large dataset? 🤗Datasets	12	10380	August 24, 2022

How to use Datasets in a distributed system?

Related topics