Why Fine-Tune a ViLT model For Images And Text Classification is showing out of index error?

I am trying to fine tune a ViltForImagesAndTextClassification model. In my dataset, I have 10 images with 1 text input. Here is my model configuration:

ViltConfig {
  "_name_or_path": "dandelin/vilt-b32-mlm",
  "architectures": [
  "attention_probs_dropout_prob": 0.0,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.0,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3",
    "4": "LABEL_4",
    "5": "LABEL_5",
    "6": "LABEL_6",
    "7": "LABEL_7",
    "8": "LABEL_8",
    "9": "LABEL_9"
  "image_size": 384,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3,
    "LABEL_4": 4,
    "LABEL_5": 5,
    "LABEL_6": 6,
    "LABEL_7": 7,
    "LABEL_8": 8,
    "LABEL_9": 9
  "layer_norm_eps": 1e-12,
  "max_image_length": -1,
  "max_position_embeddings": 40,
  "modality_type_vocab_size": 2,
  "model_type": "vilt",
  "num_attention_heads": 12,
  "num_channels": 3,
  "num_hidden_layers": 12,
  "num_images": 10,
  "output_attentions": true,
  "output_hidden_states": true,
  "patch_size": 32,
  "qkv_bias": true,
  "tie_word_embeddings": false,
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",
  "type_vocab_size": 2,
  "vocab_size": 30522

But during the training I am getting this error.

IndexError                                Traceback (most recent call last)

<ipython-input-21-065dad2b736c> in <module>
     37       # encoding = base_processor(images, batch[1], return_tensors="pt")
---> 39       outputs = model(input_ids=batch['input_ids'], pixel_values=batch['pixel_values'])
     41       print(outputs)

8 frames

/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
   2208         # remove once script supports set_grad_enabled
   2209         _no_grad_embedding_renorm_(weight, input, max_norm, norm_type)
-> 2210     return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)

IndexError: index out of range in self

I used BertTokenizerFast for Text and ViltFeatureExtractor for Images

I can’t figure it out what is the reason for this issue.


Is it possible to provide a code snippet that reproduces your error?

In there documentation, they said that token_type_ids = 2 if I am passing 2 images… Usually, 0 for text and 1 for image. then I switched 2 to 1 and it’s working. But I don’t know it will affect on my model or not as I have 10 images.

Note that ViltForImagesAndTextClassification is mainly meant for tasks like NLVR2, which is the task of, given 2 images and a text, determine whether or not the text is true or false.

Is this a use case you’d like to solve?

I want to use this model for my task where I have 10 images instead of 2. I think I am getting that error because of this following snippet:

outputs = self.vilt(
              pixel_values=pixel_values[:, i, :, :, :] if pixel_values is not None else None,
              pixel_mask=pixel_mask[:, i, :, :] if pixel_mask is not None else None,
              image_embeds=image_embeds[:, i, :, :] if image_embeds is not None else None,

Earlier image_token_type_idx=i+1. when I changes this image_token_type_ids = 1, now It’s working.
As I have 10 images, should I use like it was before.