How to use Datasets in a distributed system?

Hello,

My application relies on Dataset to manage some data.
We have multiple workers loading the same Dataset to do some computation.

Currently, a single worker writes to the Dataset and adds columns to the table.
I need the other workers to have access to the latest data.

In the past, we were able to overwrite the dataset and the workers would simply reload the new version. But we get an error as of 1.16 saying that we can’t overwrite a dataset.

What is the best way to do this? For now, I can disable caching to avoid the error.

Hi ! Indeed it’s not possible to write to an opened dataset anymore (or you may corrupt its data).

Depending on your distributed setup you can either pickle the dataset from one worker to the other (it only pickles the path to the local arrow files to be reloaded, not the actual data), or save the dataset to a new location using save_to_disk and make the other workers reload the dataset from there.

Thanks for the reply.

I proposed a solution in Ability to split a dataset in multiple files. · Issue #3544 · huggingface/datasets · GitHub

In brief, we would edit state.json to keep track of the new columns added as files. What do you think?

Note that this does not solve the issue when we update a value in the dataset.

I am currently using a versioning mechanism every time I modify the dataset and the workers load the latest version.

Open to suggestions :slight_smile:

Quick update :slight_smile:

In a distributed setup, you can split a dataset by node using

import os
from datasets.distributed import split_dataset_by_node

ds = split_dataset_by_node(ds, rank=int(os.environ["RANK"]), world_size=int(os.environ["WORLD_SIZE"]))

This can be used to train models in distributed setup, with native support of the PyToch DataLoader

But if you wish to process the dataset, then each node can write data in separated directories, which can be reloaded as datasets and concatenated later.

Hello!

Thanks for the update, in the end we did the following:

  • Once a worker is done with a task, it writes a new version of the dataset
  • When a worker gets a new job, it grabs the latest version.

This fixes our issue, if a worker is using an old version of the dataset, there is no problem as we are always writing to a new version.

Now that my project is open source, I can better describe the solution.

  1. We can save the dataset to a new version path with _get_new_version_path
  2. We can load the latest cache with load_cache

While not optimal, it fixes our current problem.

Thank you!