Big text dataset loading for training

I want to load a big dataset which is in shards of .txt files. Total dataset size is around 250GB.

If I just load it using load_dataset("text", data_files=[]), then it will load the complete dataset into the memory or it will use Arrow to do memory mapping and reduce the memory footprint of the dataset ?

And If I want to use the dataset for training, is it advisable to directly use Dataset or IterableDataset ?

1 Like

If you have excess VRAM or RAM, it seems best to use Dataset, but if not, it seems best to use IterableDataset with num_workers= increased.

Hi ! load_dataset() converts the .txt files to arrow and memory maps the arrow file on disk, so yes it doesn’t fill your RAM.

And If I want to use the dataset for training, is it advisable to directly use Dataset or IterableDataset ?

Depends on your dataset size, Dataset is perfectly fine for <1TB. For bigger datasets IterableDataset provides faster initialization (no conversion needed) and saves disk space

1 Like