How to use Datasets in a distributed system?

Dref360 · January 13, 2022, 11:41pm

Thanks for the reply.

In brief, we would edit state.json to keep track of the new columns added as files. What do you think?

Note that this does not solve the issue when we update a value in the dataset.

I am currently using a versioning mechanism every time I modify the dataset and the workers load the latest version.

Open to suggestions

Topic		Replies	Views
Loading webdatasets across multiple nodes 🤗Datasets	3	1630	April 21, 2025
How to save datasets as distributed with save_to_disk? 🤗Datasets	1	2492	November 15, 2022
Distributed data sampling for streaming 🤗Datasets	2	1845	October 4, 2023
How to overwrite dataset with dataset.push_to_hub() or alternative? 🤗Hub	3	2304	September 20, 2023
Support of very large dataset? 🤗Datasets	12	10426	August 24, 2022