What is the best way to fine-tune ViT with a custom dataset?

I have checked out the course and I have come across tutorials for fine-tuning pre-trained models for NLP tasks.

But I would really like to use the Vision Transformer model for classifying images that I have. I have about 1.8k images belonging to 3 categories, and I would like to use ViT for classification. I want to fine-tune the model to my dataset and thus leverage transfer learning.

This is a task of single-label classification.

How can I do this? What is the best way to fine-tune the pretrained ViT model for a classification task to a smaller dataset?

Can anyone point me towards any recipes or tutorials or other forms of how-tos?

Thanks.

Hi there! I made some demos on how to fine-tune ViT on a custom dataset here:

1 Like