Getting IndexError: list index out of range when fine-tuning

Hi everyone! I want to fine-tune my pre-trained Longformer model and am getting this error:-

---------------------------------------------------------------------------

IndexError                                Traceback (most recent call last)

<ipython-input-54-2f2d9c2c00fc> in <module>()
     45     )
     46 
---> 47 train_results = trainer.train()

6 frames

/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
   1032             self.control = self.callback_handler.on_epoch_begin(self.args, self.state, self.control)
   1033 
-> 1034             for step, inputs in enumerate(epoch_iterator):
   1035 
   1036                 # Skip past any already trained steps if resuming training

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in __next__(self)
    515             if self._sampler_iter is None:
    516                 self._reset()
--> 517             data = self._next_data()
    518             self._num_yielded += 1
    519             if self._dataset_kind == _DatasetKind.Iterable and \

/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
    555     def _next_data(self):
    556         index = self._next_index()  # may raise StopIteration
--> 557         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    558         if self._pin_memory:
    559             data = _utils.pin_memory.pin_memory(data)

/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

<ipython-input-53-5e4959dcf50c> in __getitem__(self, idx)
      7 
      8     def __getitem__(self, idx):
----> 9         item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
     10         item['labels'] = torch.tensor(self.labels[idx])
     11         return item

<ipython-input-53-5e4959dcf50c> in <dictcomp>(.0)
      7 
      8     def __getitem__(self, idx):
----> 9         item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
     10         item['labels'] = torch.tensor(self.labels[idx])
     11         return item

IndexError: list index out of range

Evidently, it’s a problem with my tokenization. but I can’t it - for training the LM, I ensured length argument is set for tokenizer:

tokenizer = LongformerTokenizerFast.from_pretrained("./ny_model", max_len=3500)

with a hefty 52000 vocab size. next, when fine-tuning:

train_encodings = tokenizer(list(train_text), truncation=True, padding=True, max_length=3500)
val_encodings = tokenizer(list(val_text), truncation=True, padding=True, max_length=3500)

you can see I truncate the sequences. I tried with some dummy data (ensuring they are of equal length), same problem.

So what could the problem be? Any ideas?

Note that I am fine-tuning the model after uploading the LM on Huggingface.
Also, I have attached the code required to train the LM:------
Google Colab

have you figured out the reason? getting the same error here

1 Like

I don’t remember how I fixed it, but most probably it was something on your data side

Seems from the snippet that you have less labels than input, or that your labels are empty.

can anyone help me out pls, getting the same error

1 Like

Has this been figured out?

1 Like

Use padding=‘max_length’ or ‘longest’ instead of True.

I had a similar issue with the exact same error where generate() didn’t work. I modified the input_ids and therefore also the attention_mask, which lead to them both having the wrong dimensions! I used …[input_ids].unsequeeze(0) on both, which then worked.

1 Like