Iβm using train_text_to_image.py
to train on a fairly large dataset of 920k images. During the preprocessing of the dataset everything seems okay until I get to βExtracting data filesβ, at which point it hangs:
`Resolving data files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 920251/920251 [00:02<00:00, 343404.14it/s]
Resolving data files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 920251/920251 [00:02<00:00, 339180.98it/s]
Resolving data files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 920251/920251 [00:04<00:00, 205028.97it/s]
Resolving data files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 920251/920251 [00:02<00:00, 348870.18it/s]
Resolving data files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 920251/920251 [00:02<00:00, 364106.49it/s]
Resolving data files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 920251/920251 [00:02<00:00, 374222.79it/s]
Resolving data files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 920251/920251 [00:04<00:00, 219925.98it/s]
Resolving data files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 920251/920251 [00:07<00:00, 127850.77it/s]
Downloading and preparing dataset imagefolder/default to /drive/storage/data/hf_cache/imagefolder/default-6ce0ff74ab1e4790/0.0.0/37fbb85cc714a338bea574ac6c7d0b5be5aff46c1862c1989b20e0771199e93fβ¦
Downloading data files: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 920251/920251 [00:10<00:00, 84143.52it/s]
Downloading data files: 0it [00:00, ?it/s]
Extracting data files: 0it [00:00, ?it/s]
If I monitor the process in top
I can see that, when it gets to this point, a single python process runs at 100% with continuously increasing resident memory until it hits the systemβs maximum. It will stay there for a while but eventually quit (possibly running out of swap, but Iβm not sure). Earlier stages of processing run on 8 processes.
Iβm running using Accelerate on an 8x A100 80GB Google compute node with 1.3TB main memory. My initial batch size was 8 but Iβve also tried with 1 (I donβt think this has anything to do with batches, but I thought Iβd give it a try).
Any help greatly appreciated as Iβm kind of stuck here!