We are currently dealing with huge number of images which defintley wont fit the memory of our workstations, we wrote a couple of loading scripts following the tutorial from here and saw that it will take decades to generate the dataset using a single core … here we encounter with a couple of questions, first is that, does the GeneratorBasedBuilder class supports multiprocessing in a plug and play fashion (multiprocessing need not to be written by us) second since the dataset is huge we found out that while generating the dataset using load_dataset function it loads all the data to the computer's memory is there anyway that makes it to be persistent on the disk instead of memory?
In the next version of datasets you’ll be able to do
num_proc = 16 # choose the number of processes to run in parallel
ds = load_dataset(..., num_proc=num_proc)
Each process is given a subset of shards to process - so just make sure your dataset is made of many shards
You can already install datasets from this branch if you want to try it out.
Re: memory
datasets flushes the data on disk every batch 10,000 items per process (in the next release it will be decreased to 1,000). So as long as the batches fit in memory you are fine. You can set a custom dataset writer batch size by setting the DEFAULT_WRITER_BATCH_SIZE class attribute of your dataset builder. It can be useful to set it to a lower value to reduce the memory usage.