Data augmentation for image (ViT) using Hugging Face

Hi everyone,

I am currently doing the training of a ViT on a local dataset of mine. I have used the dataset template of hugging face to create my own dataset class.

To train my model I use pytorch functions (Trainer etc…), and I would like to do some data augmentation on my images.

Does hugging face allow data augmentation for images ? Otherwise, guessing I should use pytorch for the data augmentation, how could I proceed ?

Thank you


the feature extractors (like ViTFeatureExtractor) are fairly minimal, and typically only support resizing of images and normalizing the channels. For all kinds of image augmentations, you can use torchvision’s transforms or albumentations for example.

1 Like

Hi, thanks for the reply.

Being more specific, what is the best way to implement a data augmentation on-the-fly during the training using torchvision ? (since my dataset is already very large I can’t create the augmented dataset and then load it, it would take way too much time and memory)

Is it feasible in the function generate_examples of the dataset class ?

If not, where would you advise me to do it ?

If you can put some bricks of code (just to have the idea of the implementation you have in mind) I’d be more than happy.



You’re in luck, cause we’ve recently added an image classification script to the examples folder of the Transformers library. It illustrates how to use Torchvision’s transforms (such as CenterCrop, RandomResizedCrop) on the fly in combination with HuggingFace Datasets, using the .set_transform() method.

Amazing !

Thanks a lot

Just a question : since I am using pytorch lightning for the training, if I apply the transforms.Compose operation in the preprocess_images (the function doing basically a moveaxis and applying feature_extractor as you defined here : Transformers-Tutorials/Fine_tuning_the_Vision_Transformer_on_CIFAR_10_with_PyTorch_Lightning.ipynb at master · NielsRogge/Transformers-Tutorials · GitHub), will these transformations be made on the fly during the training as you do in your example (seeing each epoch a different version of the same image) or does it create a fixed version of the dataset with data augmentation only performed at this time ?

Because I see that in your example above you use a Hugging Face Trainer, so maybe it handles data augmentation differently than the pytorch lightning trainer, in order to make it on the fly.