I want to load a big dataset which is in shards of .txt files. Total dataset size is around 250GB.
If I just load it using load_dataset("text", data_files=[]), then it will load the complete dataset into the memory or it will use Arrow to do memory mapping and reduce the memory footprint of the dataset ?
And If I want to use the dataset for training, is it advisable to directly use Dataset or IterableDataset ?
Hi ! load_dataset() converts the .txt files to arrow and memory maps the arrow file on disk, so yes it doesn’t fill your RAM.
And If I want to use the dataset for training, is it advisable to directly use Dataset or IterableDataset ?
Depends on your dataset size, Dataset is perfectly fine for <1TB. For bigger datasets IterableDataset provides faster initialization (no conversion needed) and saves disk space