How to ensure GPU utilisation when preprocessing huggingface datasets


  • I am using PyTorch and have a huggingface dataset accessed via load_dataset.
  • The dataset has a train/test split and several features: ['image', 'spectrum', 'redshift', 'targetid'].
  • I would like to apply some augmentations on the batches when they are fed to the PyTorch model. In scenario A, I would like to just work with the training split and 'image' feature and augment the images with e.g. transform = Compose([RandomHorizontalFlip(), RandomVerticalFlip(), CenterCrop(96)]) (and I guess this also needs a ToTensor() to work with PyTorch). In scenario B, I would like to work with the training split and multiple features e.g. 'image' and 'spectrum' and apply different transforms to each feature.
  • Currently, my solution avoids using huggingface’s transform functionality, and instead I do the augmentation inside my PyTorch model using the on_after_batch_transfer method (which is called once a batch has been transferred to GPU). But I would prefer to have the data augmentation happen externally to the model.
  • I can sketch out a solution to scenario A:
dataset = load_dataset('~/some_dir/')
train_images = dataset['train']['image']
image_transforms = Compose([ToTensor(), RandomHorizontalFlip(), RandomVerticalFlip(), CenterCrop(96)])
  • I can sketch out a solution to scenario B:
dataset = load_dataset('~/some_dir/')
train_data = dataset['train']['image','spectrum']
image_transforms = Compose([ToTensor(), RandomHorizontalFlip(), RandomVerticalFlip(), CenterCrop(96)])
spectrum_transforms = Compose([ToTensor(), AddNoise()])  # AddNoise() is some transform I have defined elsewhere (not a standard transform implemented in PyTorch)

def train_transform(examples):
    image = [image_transforms(img) for img in examples['image']]
    spectrum = [spectrum_transforms(spec) for spec in examples['spec']]
    return {'image':image,'spectrum':spectrum}



  • I am concerned that my proposed solutions are not the most efficient approach (which is an issue, as I do not want data augmentation to be slower than it needs to be, as it will slow down model training). As far as I can tell, it seems like the huggingface preprocessing would happen on CPU (slow) before the data is transferred to the model and put onto the GPU. My current solution of using on_after_batch_transfer avoids this problem as the preprocessing happens once the data has been put on the GPU. Please could you confirm whether my assumption that huggingface preprocessing happens on CPU is correct, and if so please could you propose a way to do the preprocessing on GPU.
  • I would also like to follow the best-practice approach to achieve my goal of loading a huggingface dataset and preprocessing the dataset efficiently (i.e. on GPU) before the data is passed to the PyTorch model. If I am going about this in completely the wrong way, please disregard my proposed solutions and give the best-practice approach.