Support of very large dataset?

What would be the recommended usage of datasets given I have large dataset e.g. common crawl, and need distributed training? For example, is there a build-in functionality that I could preprocess the data once and save/load in disk in a binarized/efficient way? And is there anything worth noticing for efficient distributed training with large datasets?

Tried to go over the doc but didn’t find anything on this.
Thanks!

Hi ! Sure the datasets library is designed to support the processing of large scale datasets. Datasets are loaded using memory mapping from your disk so it doesn’t fill your RAM. You can parallelize your data processing using map since it supports multiprocessing. Then you can save your processed dataset using save_to_disk, and reload it later using load_from_disk

from datasets import load_dataset, load_from_disk

dataset = load_dataset(...)
dataset = dataset.map(..., num_proc=num_processes)
dataset.save_to_disk("path/to/save/directory")

# later
dataset = load_from_disk("path/to/save/directory")

Things worth noticing:

  • you can specify a cache_dir parameter in load_dataset so that you can store the raw + prepared data wherever you want and be able to delete them later to save space if needed.
  • If you are working on a cluster with a virtual filesystem, you may want to make sure that the memory mapping works efficiently. This is probably the case if you are doing distributed training. There is a discussion about this in here if this is the case. We are still investigating why some virtual filesystems have such behaviors.
1 Like

Thanks for the great reply! This is very helpful.

Quick questions:

  1. Should I do data preprocessing before launching a distributed training job? In other words, would the ideal case be just having dataset = load_from_disk("path/to/save/directory") in distributed training script?

  2. Is there any built-in prefetch function like the one in e.g. fairseq? Or is this already taken care of?

  3. Thanks for the great reply! This is very helpful.

Quick questions:

  1. Should I do data preprocessing before launching a distributed training job? In other words, would the ideal case be just having dataset = load_from_disk("path/to/save/directory") in distributed training script?

  2. Is there any built-in prefetch function like the one in e.g. fairseq? Or is this already taken care of?

  3. Out of curiosity, how was the shuffling done if not all data is loaded into RAM?

  1. Sure this is the easiest way to load your processed dataset in your training script
  2. When you load the dataset, then the full dataset is loaded from your disk. There’s no prefetch function: you can directly access any element at any position in your dataset.
  3. The shuffling is done by shuffling the index of the dataset (i.e. the mapping between what __getitem__ returns and the actual position of the examples on disk). The actual elements on disk are not shuffled. Shuggling the index is done in memory though, and the resulting index is written to disk and loaded from the disk via memory mapping afterwards.

Thanks!

For 2) would this be the bottomneck if there is no prefetch?

For 3) would this be fast enough? I guess it would be costly to get an arbitrary index from disk?

  1. The bottleneck is in general the I/O limitations provided by memory mapping, which depends on your hardware.
  2. Get examples from an arbitrary positions from disk is pretty fast - we used the arrow format especially for that. It’s in general a matter of milliseconds. Shuffling the index on the other hand might be pretty slow if you have hundreds of billions of examples since it tries to do exact shuffling. This can be improved by using an approximate shuffling method (though we currently don’t have one in the library)

Hope that helps :slight_smile:

1 Like

Thanks a lot, Quentin!

1 Like

I assume you’ll be applying tokenizers or other processing to that large dataset using .map()? In that case, what was your batch_size and writer_batch_size? I found it very difficult to find the right numbers that do not consume all my RAM and was fast enough (my dataset is >100GB).

@lhoestq

Hi lhoestq! Thanks for explaninig how to handle very large dataset.

I have another question about save_to_disk and load_from_disk.

My dataset has a lot of files (#files : 10000) and its size is bigger than 5T.

The workflow involves preprocessing and saving its result using save_to_disk per file (or it takes a long time to make tables).

So it results 10000 arrow files. (total size : 8T)

After this, I edit state.json of one of directories that saved so that its "_data_files" points arrow files that I made.

The problem is that load_from_disk takes a lot of time (like 6 hours or 3 hours for reloading)

Could I reduce the time it takes? or Could you share another great way to preprocess files?

Hi ! Did you try using load_from_disk on each one of the 10,000 arrow files and then use concatenate_datasets to get the full dataset ?

Thanks for the suggestion! I figured out that this part takes a lot of time. So, I just load each one of the files using load_from_disk and concat them as you said.

Thanks for diving into it :slight_smile:
The URL you shared seems to point to a dynamic code snippet (from the master branch), that got some changes a few days ago. For future reference, the “part that takes a lot of time” is here

Do you know by any chance what part of the code exactly in the Dataset initialization caused some slowdowns on your side ?