Running out of memory processing dataset

ozanciga · March 27, 2023, 5:03pm

i had a similar issue where preprocessing of the dataset would just fill up the memory and oom’d. i solved it by using IterableDataset but i got the feeling it wasn’t desirable. i feel like preprocessing (specifically .map() fn) is for small, not memory-intensive operations like tokenization, and not for loading up large datasets into memory like images. i remember doing something hacky like using transform/augmentation to load an image on the fly while only storing the path but all of this is my limited experience, because .map() supposedly shouldn’t lead to oom because it doesn’t load dataset all at once. i just got frustrated at some point and decided to not figure out the right way.

you can shuffle an iterable/streaming dataset, see above link, also look into trainer callbacks where you can invoke a reshuffle after each epoch. i haven’t tested below code but something like this should work:

class ShuffleCallback(TrainerCallback):
    def on_epoch_begin(self, args, state, control, train_dataloader, **kwargs):
        if isinstance(train_dataloader.dataset, IterableDataset):
            train_dataloader.dataset.set_epoch(train_dataloader.dataset._epoch + 1)

…

and

trainer_object = Trainer(
...
callbacks=[ShuffleCallback()],
)

Topic		Replies	Views
Prevent iterable dataset from consuming all the rams Beginners	2	26	June 24, 2025
Best practices for a large dataset 🤗Datasets	7	1819	May 6, 2025
A streaming dataset's memory footprint continually grows 🤗Datasets	8	124	June 19, 2025
OOM issue with large dataset streaming 🤗Datasets	6	143	March 15, 2025
Big text dataset loading for training 🤗Datasets	2	160	May 7, 2025

Running out of memory processing dataset

Related topics