Multiprocessing and sharding when creating dataset from scratch using loading script

parano · October 29, 2022, 4:16pm

Hi guys …

We are currently dealing with huge number of images which defintley wont fit the memory of our workstations, we wrote a couple of loading scripts following the tutorial from here and saw that it will take decades to generate the dataset using a single core … here we encounter with a couple of questions, first is that, does the GeneratorBasedBuilder class supports multiprocessing in a plug and play fashion (multiprocessing need not to be written by us) second since the dataset is huge we found out that while generating the dataset using load_dataset function it loads all the data to the computer's memory is there anyway that makes it to be persistent on the disk instead of memory?

BR

lhoestq · November 4, 2022, 3:02pm

Hey ! We’re adding multiprocessing here: Multiprocessed dataset builder [WIP] by TevenLeScao · Pull Request #5107 · huggingface/datasets · GitHub

In the next version of datasets you’ll be able to do

num_proc = 16  # choose the number of processes to run in parallel
ds = load_dataset(..., num_proc=num_proc)

Each process is given a subset of shards to process - so just make sure your dataset is made of many shards

You can already install datasets from this branch if you want to try it out.

Re: memory

datasets flushes the data on disk every batch 10,000 items per process (in the next release it will be decreased to 1,000). So as long as the batches fit in memory you are fine. You can set a custom dataset writer batch size by setting the DEFAULT_WRITER_BATCH_SIZE class attribute of your dataset builder. It can be useful to set it to a lower value to reduce the memory usage.

parano · November 4, 2022, 3:17pm

Nice …

Let me know if I can contribute to any of the stuff …

Topic		Replies	Views
How does Dataset.from_generator store data bigger than RAM? 🤗Datasets	1	17	June 19, 2025
Expected memory usage of Dataset Beginners	1	2775	July 4, 2023
Streaming and creating refactored dataset with shards using Generator 🤗Datasets	4	221	October 30, 2024
How to create a new large Dataset on disk? 🤗Datasets	10	3256	July 6, 2022
How do I download and load a dataset in batches without caching all of it? 🤗Datasets	1	226	September 16, 2024

Multiprocessing and sharding when creating dataset from scratch using loading script

Related topics