PyTorch Dataset/DataLoader classes


My question is not directly related to HF libraries, so it is a bit off-topic, but I hope the moderators will not take too strict a view on that and let me keep it.

When training ResNet on ImageNet dataset, I coded some dataloading functionality by hand, which was extremely useful to me. I am currently transitioning from TF2 to PyTorch and I am very new to PyTorch Dataset and Dataloader classes. I am wondering whether PyTorch Dataset/DataLoader classes make the flow I coded by hand available out of the box. I did read PyTorch tutorials and API docs before posting the question.

Here is the problem:
ImageNet consists of ~ 1.3mln JPEG images, which take about 140Gb disk space. When compiling a batch, one needs to read a batch_size number of image files from the disk, and each of them needs to be pre-processed and this pre-processing is computationally expensive (load an image file of size NxM, randomly choose an integer in the interval [256, 480], re-size the image in the way that the shortest size is equal to this integer, crop randomly to extract a 224x224 square image, apply random color transformation to it, etc…). If this pre-processing could be done once and then used for all the epochs, it wouldn’t be a problem at all, but it needs to be re-done for each file each epoch (that’s how data augmenation is achieved). And training requires a large number of epochs (50-120).

Here is how I solved it:
I borrowed 30Gb from my RAM and made a buffer disk out of it. This space is enough to comfortably accomodate more that 1,000 batches (already pre-processed). I created a text file, monitoring the training progress (with two lines: current epoch number, current batch number), the training process updates this file after each 200 batches (which is equivalent of 1% of epoch with my batch size). Then I
wrote a run-time pre-processing script (both training and pre-processing run at the same time in parallel), which checks:

  1. where the training process currently is
  2. where the buffer currently starts (at which batch number)
  3. where the buffer currently ends (at which batch number)

If the pre-processing script sees that the training process went ahead and it is now safe to delete some batches from the start of the buffer, then it does so to free space. If the pre-processing script sees that the training process has less than 800 batches pre-processed and waiting to be fed to the training process, then it jumps into action, pre-processes more batches and places them at the end of the queue. Then it waits. It checks every 100 seconds whether there is any work for it, does it if there is, and then waits again. Pre-processing takes place 100% on the CPU and I could use multiple threads. This is important. Without it the pre-processing would not be able to work fast enough, and the main GPU training process would have to wait (which is unacceptable).

Can PyTorch Dataset/DataLoader classes provide the above functionality out of the box? If yes, I would appreciate if you could give a me push in the right direction. There is even no need to give me a code example (although that would be nice). Just tell me whether it is possible and where to look if it is.

Thank you!!


If there is anybody who is only 80-90% sure that there is no out of the box functionality like that, it would also help me a lot.

Hi ! I think you can use the PyTorch DataLoader:

  • with num_workers>0, you can use multiprocessing to load your data in parallel to the optimizations steps
  • with the prefetch_factor parameter, you can say how many samples are loaded in advance by each data loading worker.

Let me know if that was what you were looking for :slight_smile:

1 Like

Yes! Thank you!