Equivalent for ignore token for Vision Transformers?

Hello everyone,

A huge thank you as always for your help! Was wondering if there was a way to ignore parts of an image when feeding them into a vision transformer - kind of like the “-100” for text transformers. Any simple way using the huggingface models/trainer?

My dataset consists of images as inputs, but there is a variable number of them per data point. Data points go from 1 image (most common) all the way up to 5 images (slightly less common) and everything in between. My thinking was to tile them, and leave some blanks when the tiles are not present - just want the model to ignore those. Can also resize them so that every pixel is always in use… any other ideas? I don’t think it is “worth” going to something like a video transformer… it’s never more than 5 images per data point.

Thank you again.