Best practices for a large dataset

nitishpandey04 · May 6, 2025, 2:21pm

Tell me, so you have implemented a pytorch IterableDataset that is calling 2 huggingface IterableDatasets for +ve and -ve class items right?

I think using individual iterable datasets for +ve and -ve classes can be a good idea but the wrapper you are using on top of those datasets doesn’t need to be iterable dataset itself. The parent can be a normal dataset. Have you tried this ?

Or maybe you can try to implement a single iterable dataset in pytorch which loads both +ve and -ve classes

I opened this issue - Big text dataset loading for training. Any insights you can share ?

Topic		Replies	Views
Big text dataset loading for training 🤗Datasets	2	209	May 7, 2025
Support of very large dataset? 🤗Datasets	12	10452	August 24, 2022
How to use several datasets that fit into the RAM? 🤗Datasets	1	502	November 5, 2021
Trying to figure out when is a dataset stored in memory? 🤗Datasets	4	1206	June 29, 2023
Lazy-Loading binarized shard using Hf-dataset for Hf-Trainer 🤗Datasets	4	2532	June 24, 2021

Best practices for a large dataset

Related topics