How can I multithreadedly download a HuggingFace dataset?

Franck-Dernoncourt · September 24, 2023, 6:41pm

I want to download a HuggingFace dataset, e.g. uonlp/CulturaX:

from datasets import load_dataset
ds = load_dataset("uonlp/CulturaX", "en")

However, it downloads on one thread at 50 MB/s, while my network is 10 Gbps. Since this dataset is 16 TB, I’d prefer to download it faster so that I don’t have to wait for a few days. How can I multithreadedly download a HuggingFace dataset?

lhoestq · September 25, 2023, 9:42am

You can use multiprocessing to parallelize the downloads and conversion to Arrow by passing num_proc= to load_dataset.

kopyl · December 19, 2023, 4:29am

It does not work. Why?

Topic		Replies	Views
Extremely slow data loading of imagefolder 🤗Datasets	9	2435	January 4, 2024
Read CSV multi threading 🤗Datasets	5	1443	July 21, 2021
Problem when downloading image dataset Beginners	2	63	October 28, 2024
Multithreading with map 🤗Datasets	1	950	January 23, 2023
How do I download and load a dataset in batches without caching all of it? 🤗Datasets	1	231	September 16, 2024

How can I multithreadedly download a HuggingFace dataset?

Related topics