I want to load a big dataset which is in shards of .txt files. Total dataset size is around 250GB.
If I just load it using load_dataset("text", data_files=[])
, then it will load the complete dataset into the memory or it will use Arrow to do memory mapping and reduce the memory footprint of the dataset ?
And If I want to use the dataset for training, is it advisable to directly use Dataset or IterableDataset ?
1 Like
If you have excess VRAM or RAM, it seems best to use Dataset
, but if not, it seems best to use IterableDataset
with num_workers=
increased.
Hi, I have a ~1TB large dataset stored at the HF hub. I can download this on my disk and read it successfully. Nevertheless, it’s large enough that I can’t fit it in my RAM.
What is the best practice for training a model on such a dataset?
I tried loading the dataset with load_dataset(..., streaming=True) and then having two buffers: one that is being loaded by training process onto GPU and one that, in a separate thread, streams from the dataset and fills up the other buffer. Then, when the i…
opened 07:50AM - 27 Apr 22 UTC
closed 03:07PM - 25 Jul 23 UTC
bug
### System Info
```shell
I am using transformer trainer while meeting the issue… .
The trainer requests torch.utils.data.Dataset as input, which loads the whole dataset into the memory at once. Therefore, when the dataset is too large to load, there's nothing I can do except using IterDataset, which loads samples of data seperately, and results in low efficiency.
I wonder if there are any tricks like Sharding in huggingface trainer.
Looking forward to your reply.
```
### Who can help?
Trainer: @sgugger
### Information
- [ ] The official example scripts
- [ ] My own modified scripts
### Tasks
- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)
### Reproduction
None
### Expected behavior
```shell
I wonder if there are any tricks like fairseq Sharding very large datasets https://fairseq.readthedocs.io/en/latest/getting_started.html.
Thanks a lot!
```