Best practices for a large dataset

Tell me, so you have implemented a pytorch IterableDataset that is calling 2 huggingface IterableDatasets for +ve and -ve class items right?

I think using individual iterable datasets for +ve and -ve classes can be a good idea but the wrapper you are using on top of those datasets doesn’t need to be iterable dataset itself. The parent can be a normal dataset. Have you tried this ?

Or maybe you can try to implement a single iterable dataset in pytorch which loads both +ve and -ve classes

I opened this issue - Big text dataset loading for training. Any insights you can share ?

1 Like