Any best practices example on integrating a pretrained HuggingFace ViT into a pytorch lightning module?

There are a few examples (mostly notebooks) on how to run and fine-tune these models, but what I’m really looking for is a clean example on how to integrate for example a pretrained ViT into my own pipeline that uses Lightning. For example: should I instantiate the AutoImageProcessor() class inside my pl.LightningModule or rather do that in my pl.DataModule, should I implement my own forward method or rather just call the the forward of the pretrained model (which would be an attribute of my Lightning class)? I wish there were more examples like this instead of scripts and line by line usage.

Hi,

I do have a notebook on fine-tuning ViT with PyTorch Lightning here: Transformers-Tutorials/Fine_tuning_the_Vision_Transformer_on_CIFAR_10_with_PyTorch_Lightning.ipynb at master · NielsRogge/Transformers-Tutorials · GitHub

1 Like

Oh that’s nice @nielsr ! This is more like what I was looking for! I was playing with the semantic segmentation models. BTW, do you recommend using the build-in loss inside the model (ouputs.loss) during training (if given the labels in the input as well), or rather using a loss defined myself (like you are doing in this notebook)?

It’s recommended to use the loss by the model itself. Any model in the Transformers library will calculate an appropriate loss if you provide labels to it.

1 Like

Hi @nielsr ,
How to create a dataset, if we are using a custom dataset instead a huggingface dataset?

You need to create a PyTorch dataset (and then a corresponding dataloader) as shown here. Each item of the dataset should return pixel_values and a label (both PyTorch tensors).