Working with large datasets

vblagoje · November 3, 2020, 4:43pm

I am preparing datasets for BERT pre-training and often save_to_disk simply dies without saving the contents of the files to disk. The largest file I was able to save was 18 GB but above that, I am having no luck.

Any performance tips for dealing with large datasets? Should I simply shard before saving to disk? If I do that, then I get copies of 18 GB files in each shard’s directory. What are my options?

Forgot to mention. I am already using num_proc and larger batches to speed up dataset map invocations. Those work great. It’s the save_to_disk that I am not sure how to deal with. And sharding without additional copies of the underlying dataset being copies in all shard directories.

Thanks in advance.

lhoestq · November 3, 2020, 6:22pm

Hi !
Is there an error message ?

vblagoje · November 3, 2020, 6:39pm

No, the Python process just dies. I have 120 GB of RAM on this machine and 500 GB of disk space. If I could shard this large dataset before saving that would work too. I would just save these shards one by one. But by doing sharding I get copies of of dataset.arrow in each dir and it quickly adds up.

vblagoje · November 4, 2020, 12:59pm

I found a way to resolve the issue - shard the dataset before doing the last transformation and save_to_disk. That way resulting shards are not copies of dataset.arrow and indices file. If you are curious to have a look, here it is create_pretraining_dataset.py

lhoestq · November 9, 2020, 1:36pm

You had several copies of the full dataset file because the shard method only applies an indices mapping on top of the loaded dataset file: it doesn’t create a new dataset file. To remove the indices mapping and write one file per shard you have to call flatten_indices on each shard.

Do you have a way to reproduce the crash you experienced ?

vblagoje · November 10, 2020, 9:54am

Gotcha, thanks! Perhaps flatten_indices could be a parameter when sharding? I didn’t see a crash any more as I moved onto this new approach. I am now trying to figure out how to create superfast dataloaders using datasets. If you got some tips on that one lmk.

Topic		Replies	Views
Batching vs. Sharding a Large Dataset 🤗Datasets	4	2219	June 8, 2021
IndexError using save_to_disk 🤗Datasets	3	1530	February 1, 2024
Support of very large dataset? 🤗Datasets	12	10377	August 24, 2022
Big text dataset loading for training 🤗Datasets	2	118	May 7, 2025
How to save datasets as distributed with save_to_disk? 🤗Datasets	1	2469	November 15, 2022

Working with large datasets

Related topics