What would be the recommended usage of
datasets given I have large dataset e.g. common crawl, and need distributed training? For example, is there a build-in functionality that I could preprocess the data once and save/load in disk in a binarized/efficient way? And is there anything worth noticing for efficient distributed training with large datasets?
Tried to go over the doc but didn’t find anything on this.