Working with large datasets

Hey @lhoestq,

I am preparing datasets for BERT pre-training and often save_to_disk simply dies without saving the contents of the files to disk. The largest file I was able to save was 18 GB but above that, I am having no luck.

Any performance tips for dealing with large datasets? Should I simply shard before saving to disk? If I do that, then I get copies of 18 GB files in each shard’s directory. What are my options?

Forgot to mention. I am already using num_proc and larger batches to speed up dataset map invocations. Those work great. It’s the save_to_disk that I am not sure how to deal with. And sharding without additional copies of the underlying dataset being copies in all shard directories.

Thanks in advance.

Hi !
Is there an error message ?

No, the Python process just dies. I have 120 GB of RAM on this machine and 500 GB of disk space. If I could shard this large dataset before saving that would work too. I would just save these shards one by one. But by doing sharding I get copies of of dataset.arrow in each dir and it quickly adds up.

I found a way to resolve the issue - shard the dataset before doing the last transformation and save_to_disk. That way resulting shards are not copies of dataset.arrow and indices file. If you are curious to have a look, here it is create_pretraining_dataset.py

1 Like

You had several copies of the full dataset file because the shard method only applies an indices mapping on top of the loaded dataset file: it doesn’t create a new dataset file. To remove the indices mapping and write one file per shard you have to call flatten_indices on each shard.

Do you have a way to reproduce the crash you experienced ?

Gotcha, thanks! Perhaps flatten_indices could be a parameter when sharding? I didn’t see a crash any more as I moved onto this new approach. I am now trying to figure out how to create superfast dataloaders using datasets. If you got some tips on that one lmk.

1 Like