Equivalent for ignore token for Vision Transformers?

mohotmoz · May 12, 2022, 9:19am

Hello everyone,

A huge thank you as always for your help! Was wondering if there was a way to ignore parts of an image when feeding them into a vision transformer - kind of like the “-100” for text transformers. Any simple way using the huggingface models/trainer?

My dataset consists of images as inputs, but there is a variable number of them per data point. Data points go from 1 image (most common) all the way up to 5 images (slightly less common) and everything in between. My thinking was to tile them, and leave some blanks when the tiles are not present - just want the model to ignore those. Can also resize them so that every pixel is always in use… any other ideas? I don’t think it is “worth” going to something like a video transformer… it’s never more than 5 images per data point.

Thank you again.

Topic		Replies	Views
Image Features as Model Input Beginners	2	928	November 18, 2020
How to use Trainer with Vision Transformer Beginners	3	1691	October 19, 2021
Transformers, limiting output to 200 words 🤗Transformers	0	290	August 23, 2022
Data augmentation for image (ViT) using Hugging Face Beginners	9	5993	December 10, 2021
Any examples on VisualBERTforMultipleChoice 🤗Transformers	1	414	March 3, 2022

Equivalent for ignore token for Vision Transformers?

Related topics