Multiprocessing and sharding when creating dataset from scratch using loading script

Hi guys …

We are currently dealing with huge number of images which defintley wont fit the memory of our workstations, we wrote a couple of loading scripts following the tutorial from here and saw that it will take decades to generate the dataset using a single core … here we encounter with a couple of questions, first is that, does the GeneratorBasedBuilder class supports multiprocessing in a plug and play fashion (multiprocessing need not to be written by us) second since the dataset is huge we found out that while generating the dataset using load_dataset function it loads all the data to the computer's memory is there anyway that makes it to be persistent on the disk instead of memory?

BR

Hey ! We’re adding multiprocessing here: Multiprocessed dataset builder [WIP] by TevenLeScao · Pull Request #5107 · huggingface/datasets · GitHub

In the next version of datasets you’ll be able to do

num_proc = 16  # choose the number of processes to run in parallel
ds = load_dataset(..., num_proc=num_proc)

Each process is given a subset of shards to process - so just make sure your dataset is made of many shards :slight_smile:

You can already install datasets from this branch if you want to try it out.

Re: memory

datasets flushes the data on disk every batch 10,000 items per process (in the next release it will be decreased to 1,000). So as long as the batches fit in memory you are fine. You can set a custom dataset writer batch size by setting the DEFAULT_WRITER_BATCH_SIZE class attribute of your dataset builder. It can be useful to set it to a lower value to reduce the memory usage.

Nice …

Let me know if I can contribute to any of the stuff …