I am preparing datasets for BERT pre-training and often save_to_disk simply dies without saving the contents of the files to disk. The largest file I was able to save was 18 GB but above that, I am having no luck.
Any performance tips for dealing with large datasets? Should I simply shard before saving to disk? If I do that, then I get copies of 18 GB files in each shard’s directory. What are my options?
Forgot to mention. I am already using num_proc and larger batches to speed up dataset map invocations. Those work great. It’s the save_to_disk that I am not sure how to deal with. And sharding without additional copies of the underlying dataset being copies in all shard directories.
Thanks in advance.