How can I multithreadedly download a HuggingFace dataset?

I want to download a HuggingFace dataset, e.g. uonlp/CulturaX:

from datasets import load_dataset
ds = load_dataset("uonlp/CulturaX", "en")

However, it downloads on one thread at 50 MB/s, while my network is 10 Gbps. Since this dataset is 16 TB, I’d prefer to download it faster so that I don’t have to wait for a few days. How can I multithreadedly download a HuggingFace dataset?

You can use multiprocessing to parallelize the downloads and conversion to Arrow by passing num_proc= to load_dataset.

3 Likes

It does not work. Why?