Big text dataset loading for training

I want to load a big dataset which is in shards of .txt files. Total dataset size is around 250GB.

If I just load it using load_dataset("text", data_files=[]), then it will load the complete dataset into the memory or it will use Arrow to do memory mapping and reduce the memory footprint of the dataset ?

And If I want to use the dataset for training, is it advisable to directly use Dataset or IterableDataset ?

1 Like

If you have excess VRAM or RAM, it seems best to use Dataset, but if not, it seems best to use IterableDataset with num_workers= increased.