Support of very large dataset?

mralexis · June 15, 2021, 11:03pm

What would be the recommended usage of datasets given I have large dataset e.g. common crawl, and need distributed training? For example, is there a build-in functionality that I could preprocess the data once and save/load in disk in a binarized/efficient way? And is there anything worth noticing for efficient distributed training with large datasets?

Tried to go over the doc but didn’t find anything on this.
Thanks!

lhoestq · June 16, 2021, 2:14pm

Hi ! Sure the datasets library is designed to support the processing of large scale datasets. Datasets are loaded using memory mapping from your disk so it doesn’t fill your RAM. You can parallelize your data processing using map since it supports multiprocessing. Then you can save your processed dataset using save_to_disk, and reload it later using load_from_disk

from datasets import load_dataset, load_from_disk

dataset = load_dataset(...)
dataset = dataset.map(..., num_proc=num_processes)
dataset.save_to_disk("path/to/save/directory")

# later
dataset = load_from_disk("path/to/save/directory")

Things worth noticing:

you can specify a cache_dir parameter in load_dataset so that you can store the raw + prepared data wherever you want and be able to delete them later to save space if needed.
If you are working on a cluster with a virtual filesystem, you may want to make sure that the memory mapping works efficiently. This is probably the case if you are doing distributed training. There is a discussion about this in here if this is the case. We are still investigating why some virtual filesystems have such behaviors.

mralexis · June 16, 2021, 2:46pm

Thanks for the great reply! This is very helpful.

Quick questions:

Should I do data preprocessing before launching a distributed training job? In other words, would the ideal case be just having dataset = load_from_disk("path/to/save/directory") in distributed training script?
Is there any built-in prefetch function like the one in e.g. fairseq? Or is this already taken care of?
Thanks for the great reply! This is very helpful.

Quick questions:

Should I do data preprocessing before launching a distributed training job? In other words, would the ideal case be just having dataset = load_from_disk("path/to/save/directory") in distributed training script?
Is there any built-in prefetch function like the one in e.g. fairseq? Or is this already taken care of?
Out of curiosity, how was the shuffling done if not all data is loaded into RAM?

lhoestq · June 16, 2021, 3:07pm

Sure this is the easiest way to load your processed dataset in your training script
When you load the dataset, then the full dataset is loaded from your disk. There’s no prefetch function: you can directly access any element at any position in your dataset.
The shuffling is done by shuffling the index of the dataset (i.e. the mapping between what __getitem__ returns and the actual position of the examples on disk). The actual elements on disk are not shuffled. Shuggling the index is done in memory though, and the resulting index is written to disk and loaded from the disk via memory mapping afterwards.

mralexis · June 16, 2021, 3:16pm

Thanks!

For 2) would this be the bottomneck if there is no prefetch?

For 3) would this be fast enough? I guess it would be costly to get an arbitrary index from disk?

lhoestq · June 16, 2021, 4:31pm

The bottleneck is in general the I/O limitations provided by memory mapping, which depends on your hardware.
Get examples from an arbitrary positions from disk is pretty fast - we used the arrow format especially for that. It’s in general a matter of milliseconds. Shuffling the index on the other hand might be pretty slow if you have hundreds of billions of examples since it tries to do exact shuffling. This can be improved by using an approximate shuffling method (though we currently don’t have one in the library)

Hope that helps

mralexis · June 16, 2021, 4:56pm

Thanks a lot, Quentin!

jasonyoun · June 26, 2021, 8:09am

I assume you’ll be applying tokenizers or other processing to that large dataset using .map()? In that case, what was your batch_size and writer_batch_size? I found it very difficult to find the right numbers that do not consume all my RAM and was fast enough (my dataset is >100GB).

khel · September 2, 2021, 2:21am

@lhoestq

Hi lhoestq! Thanks for explaninig how to handle very large dataset.

I have another question about save_to_disk and load_from_disk.

My dataset has a lot of files (#files : 10000) and its size is bigger than 5T.

The workflow involves preprocessing and saving its result using save_to_disk per file (or it takes a long time to make tables).

So it results 10000 arrow files. (total size : 8T)

After this, I edit state.json of one of directories that saved so that its "_data_files" points arrow files that I made.

The problem is that load_from_disk takes a lot of time (like 6 hours or 3 hours for reloading)

Could I reduce the time it takes? or Could you share another great way to preprocess files?

lhoestq · September 6, 2021, 1:08pm

Hi ! Did you try using load_from_disk on each one of the 10,000 arrow files and then use concatenate_datasets to get the full dataset ?

khel · September 7, 2021, 1:39am

Thanks for the suggestion! I figured out that this part takes a lot of time. So, I just load each one of the files using load_from_disk and concat them as you said.

lhoestq · September 14, 2021, 2:28pm

Thanks for diving into it
The URL you shared seems to point to a dynamic code snippet (from the master branch), that got some changes a few days ago. For future reference, the “part that takes a lot of time” is here

Do you know by any chance what part of the code exactly in the Dataset initialization caused some slowdowns on your side ?

khel · August 24, 2022, 5:31am

Sorry for the late reply. I missed the thread .

Last time I looked Datasets codes about this issue, This takes a lot of time when handling large dataset.

Topic		Replies	Views
Best practices for a large dataset 🤗Datasets	7	1320	May 6, 2025
Deal with large image datasets 🤗Datasets	1	1066	October 22, 2021
How to handle big data? 🤗Datasets	7	1647	June 15, 2023
Big text dataset loading for training 🤗Datasets	2	95	May 7, 2025
Working with large datasets 🤗Datasets	5	4141	November 10, 2020

Support of very large dataset?

Related topics