Big text dataset loading for training

nitishpandey04 · May 5, 2025, 1:25pm

I want to load a big dataset which is in shards of .txt files. Total dataset size is around 250GB.

If I just load it using load_dataset("text", data_files=[]), then it will load the complete dataset into the memory or it will use Arrow to do memory mapping and reduce the memory footprint of the dataset ?

And If I want to use the dataset for training, is it advisable to directly use Dataset or IterableDataset ?

John6666 · May 5, 2025, 9:48pm

If you have excess VRAM or RAM, it seems best to use Dataset, but if not, it seems best to use IterableDataset with num_workers= increased.

github.com/huggingface/datasets

How to load VERY LARGE dataset?

opened 07:50AM - 27 Apr 22 UTC

closed 03:07PM - 25 Jul 23 UTC

CaoYiqingT

bug

### System Info ```shell I am using transformer trainer while meeting the issue…. The trainer requests torch.utils.data.Dataset as input, which loads the whole dataset into the memory at once. Therefore, when the dataset is too large to load, there's nothing I can do except using IterDataset, which loads samples of data seperately, and results in low efficiency. I wonder if there are any tricks like Sharding in huggingface trainer. Looking forward to your reply. ``` ### Who can help? Trainer: @sgugger ### Information - [ ] The official example scripts - [ ] My own modified scripts ### Tasks - [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below) ### Reproduction None ### Expected behavior ```shell I wonder if there are any tricks like fairseq Sharding very large datasets https://fairseq.readthedocs.io/en/latest/getting_started.html. Thanks a lot! ```

lhoestq · May 7, 2025, 12:41pm

Hi ! load_dataset() converts the .txt files to arrow and memory maps the arrow file on disk, so yes it doesn’t fill your RAM.

And If I want to use the dataset for training, is it advisable to directly use Dataset or IterableDataset ?

Depends on your dataset size, Dataset is perfectly fine for <1TB. For bigger datasets IterableDataset provides faster initialization (no conversion needed) and saves disk space

Topic		Replies	Views
Best practices for a large dataset 🤗Datasets	7	1234	May 6, 2025
How do i load part of the data set Beginners	3	81	May 5, 2025
How to load large dataset with streaming mode and prepare for training? 🤗Datasets	10	4163	November 3, 2023
Recommended max size of dataset? 🤗Datasets	5	128	March 11, 2025
Loading Huge Image Dataset seems to take a lot of time 🤗Datasets	7	3734	May 16, 2022

Big text dataset loading for training

Related topics