Running out of memory processing dataset

jbmaxwell · March 27, 2023, 1:22pm

I’m using train_text_to_image.py to train on a fairly large dataset of 920k images. During the preprocessing of the dataset everything seems okay until I get to “Extracting data files”, at which point it hangs:

`Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 920251/920251 [00:02<00:00, 343404.14it/s]
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 920251/920251 [00:02<00:00, 339180.98it/s]
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 920251/920251 [00:04<00:00, 205028.97it/s]
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 920251/920251 [00:02<00:00, 348870.18it/s]
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 920251/920251 [00:02<00:00, 364106.49it/s]
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 920251/920251 [00:02<00:00, 374222.79it/s]
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 920251/920251 [00:04<00:00, 219925.98it/s]
Resolving data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 920251/920251 [00:07<00:00, 127850.77it/s]
Downloading and preparing dataset imagefolder/default to /drive/storage/data/hf_cache/imagefolder/default-6ce0ff74ab1e4790/0.0.0/37fbb85cc714a338bea574ac6c7d0b5be5aff46c1862c1989b20e0771199e93f…
Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████| 920251/920251 [00:10<00:00, 84143.52it/s]
Downloading data files: 0it [00:00, ?it/s]
Extracting data files: 0it [00:00, ?it/s]

If I monitor the process in top I can see that, when it gets to this point, a single python process runs at 100% with continuously increasing resident memory until it hits the system’s maximum. It will stay there for a while but eventually quit (possibly running out of swap, but I’m not sure). Earlier stages of processing run on 8 processes.

I’m running using Accelerate on an 8x A100 80GB Google compute node with 1.3TB main memory. My initial batch size was 8 but I’ve also tried with 1 (I don’t think this has anything to do with batches, but I thought I’d give it a try).

Any help greatly appreciated as I’m kind of stuck here!

jbmaxwell · March 27, 2023, 4:46pm

I see that there’s a “streaming” option to load_dataset. Would it makes sense to make that True? And will that prevent shuffling the data?

ozanciga · March 27, 2023, 5:03pm

see: Stream

i had a similar issue where preprocessing of the dataset would just fill up the memory and oom’d. i solved it by using IterableDataset but i got the feeling it wasn’t desirable. i feel like preprocessing (specifically .map() fn) is for small, not memory-intensive operations like tokenization, and not for loading up large datasets into memory like images. i remember doing something hacky like using transform/augmentation to load an image on the fly while only storing the path but all of this is my limited experience, because .map() supposedly shouldn’t lead to oom because it doesn’t load dataset all at once. i just got frustrated at some point and decided to not figure out the right way.

you can shuffle an iterable/streaming dataset, see above link, also look into trainer callbacks where you can invoke a reshuffle after each epoch. i haven’t tested below code but something like this should work:

class ShuffleCallback(TrainerCallback):
    def on_epoch_begin(self, args, state, control, train_dataloader, **kwargs):
        if isinstance(train_dataloader.dataset, IterableDataset):
            train_dataloader.dataset.set_epoch(train_dataloader.dataset._epoch + 1)

…

and

trainer_object = Trainer(
...
callbacks=[ShuffleCallback()],
)

jbmaxwell · March 27, 2023, 5:30pm

Okay, I haven’t tried IterableDataset, as that will take a bit more work, but just setting streaming=True in the load_data call doesn’t solve it. Still quits, just more quickly (which is kinda good, I guess).

Regarding IterableDataset, is there a relatively quick way of trying that within the context of train_text_to_image.py? It uses ImageFolder by default, and my data is prepared for that approach.

UPDATE: Hmm… Actually, I see from the comments that setting the streaming flag returns an IterableDataset… bummer. I’m stumped.

ozanciga · March 27, 2023, 5:43pm

are you still getting oom? with iterabledataset you shouldn’t get oom with a small enough batch size (try 2, although your setup should be able to handle much more).

jbmaxwell · March 27, 2023, 5:46pm

Well, the problem is it never returns from load_dataset, so I don’t actually get the IterableDataset. It seems to be some processing happening inside load_dataset that tops out the memory. I’m sure it’s clear, but this is not a CUDA out of memory error, this is system memory/RAM. I don’t get far enough for a CUDA memory error.

PS - I tried a smaller batch size earlier, but I can give it another shot. But I do think the problem is prior to any kind of batching.

ozanciga · March 27, 2023, 6:08pm

can you try this? load_dataset(..., writer_batch_size=1)

jbmaxwell · March 27, 2023, 6:56pm

It hasn’t actually crashed yet, but it will soon:

top - 18:54:25 up 13:16,  3 users,  load average: 1.01, 1.12, 2.51
Tasks: 758 total,   2 running, 756 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.6 us,  1.2 sy,  0.0 ni, 98.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 1370757.+total,  96039.5 free, 1271344.+used,   3373.6 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  92884.0 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                   
  69994 james     20   0 1348.8g   1.2t 246756 R 100.0  88.9  27:44.66 python

In case it’s useful:

torch: 1.13.1+cu117
diffusers: 0.15.0.dev0

jbmaxwell · March 27, 2023, 7:37pm

Maybe I’ll grab a local diffusers copy and make an editable install and see if I can figure out exactly where in load_dataset things are going awry.

pcuenq · March 29, 2023, 9:45am

Hi @jbmaxwell! Is it a public dataset so we can try to replicate? If it is not, can you try to replicate with another dataset and see if you still have the same issue? Also copying @lhoestq in case I’m missing something obvious here.

jbmaxwell · March 29, 2023, 2:31pm

It’s a private dataset, but it turned out there were missing columns names. Strange that it wasn’t easier to catch (and that the result was this memory leak problem), but once I found it everything went as expected. I had been trying to trace the problem from load_data before I realized what was wrong, but I never found the source of the memory issue.

Topic		Replies	Views
Deal with large image datasets 🤗Datasets	1	1082	October 22, 2021
Dataset.map() OSError: [Errno 12] Cannot allocate memory Beginners	0	997	October 10, 2021
Prevent iterable dataset from consuming all the rams Beginners	2	43	June 24, 2025
Roadmap/timeline for dataset streaming 🤗Datasets	9	2284	July 5, 2021
Running out of memory during dataset.map() with `AutoFeatureExtractor.from_pretrained("facebook/hubert-large-ls960-ft")` Beginners	3	3687	June 8, 2022

Running out of memory processing dataset

Related topics