Big text dataset loading for training

Hi ! load_dataset() converts the .txt files to arrow and memory maps the arrow file on disk, so yes it doesn’t fill your RAM.

And If I want to use the dataset for training, is it advisable to directly use Dataset or IterableDataset ?

Depends on your dataset size, Dataset is perfectly fine for <1TB. For bigger datasets IterableDataset provides faster initialization (no conversion needed) and saves disk space

1 Like