Hi everyone! I want to fine-tune my pre-trained Longformer model and am getting this error:-
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-54-2f2d9c2c00fc> in <module>()
45 )
46
---> 47 train_results = trainer.train()
6 frames
/usr/local/lib/python3.7/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
1032 self.control = self.callback_handler.on_epoch_begin(self.args, self.state, self.control)
1033
-> 1034 for step, inputs in enumerate(epoch_iterator):
1035
1036 # Skip past any already trained steps if resuming training
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in __next__(self)
515 if self._sampler_iter is None:
516 self._reset()
--> 517 data = self._next_data()
518 self._num_yielded += 1
519 if self._dataset_kind == _DatasetKind.Iterable and \
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
555 def _next_data(self):
556 index = self._next_index() # may raise StopIteration
--> 557 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
558 if self._pin_memory:
559 data = _utils.pin_memory.pin_memory(data)
/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
---> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]
/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
---> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]
<ipython-input-53-5e4959dcf50c> in __getitem__(self, idx)
7
8 def __getitem__(self, idx):
----> 9 item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
10 item['labels'] = torch.tensor(self.labels[idx])
11 return item
<ipython-input-53-5e4959dcf50c> in <dictcomp>(.0)
7
8 def __getitem__(self, idx):
----> 9 item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
10 item['labels'] = torch.tensor(self.labels[idx])
11 return item
IndexError: list index out of range
Evidently, it’s a problem with my tokenization. but I can’t it - for training the LM, I ensured length
argument is set for tokenizer:
tokenizer = LongformerTokenizerFast.from_pretrained("./ny_model", max_len=3500)
with a hefty 52000 vocab size. next, when fine-tuning:
train_encodings = tokenizer(list(train_text), truncation=True, padding=True, max_length=3500)
val_encodings = tokenizer(list(val_text), truncation=True, padding=True, max_length=3500)
you can see I truncate the sequences. I tried with some dummy data (ensuring they are of equal length), same problem.
So what could the problem be? Any ideas?
Note that I am fine-tuning the model after uploading the LM on Huggingface.
Also, I have attached the code required to train the LM:------
Google Colab