I have checked out the course and I have come across tutorials for fine-tuning pre-trained models for NLP tasks.
But I would really like to use the Vision Transformer model for classifying images that I have. I have about 1.8k images belonging to 3 categories, and I would like to use ViT for classification. I want to fine-tune the model to my dataset and thus leverage transfer learning.
This is a task of single-label classification.
How can I do this? What is the best way to fine-tune the pretrained ViT model for a classification task to a smaller dataset?
Can anyone point me towards any recipes or tutorials or other forms of how-tos?
Thanks.