Not able to predict using Transformers Trainer class

LijinDurairaj · October 1, 2024, 9:39pm

I am learning to finetune a pre-trained model but when I am trying to test the finetuned model, I am getting this error

IndexError: tuple index out of range

this is my google colab notebook and I am using this Kaggle dataset

while debugging I found some of the observations

when I pass the testing data with ‘input_ids’,‘attention_mask’ and ‘labels’, I am able to get the predicted results but when I am passing the data without labels ie. when I pass only ‘input_ids’ and ‘attention_mask’ I am getting the error.
but when I pass only ‘input_ids’ and ‘attention_mask’ of the training data, I am getting the predicted value as well which I am not able to understand

I am stuck and not able to move any further, any assistance to solve the issue would be appreciated

John6666 · October 1, 2024, 10:28pm

The error itself is commonplace and often seen if you misspell the Python code, but after searching, it appears that it may be a long-term unresolved bug or bad specification.

github.com/huggingface/transformers

[Possible Bug] Getting IndexError: list index out of range when fine-tuning custom LM model

opened 12:22PM - 06 Apr 21 UTC

closed 03:03PM - 17 May 21 UTC

neel04

## Environment info `transformers` version: 4.3.3 Platform: Linux-4.…19.112+-x86_64-with-Ubuntu-18.04-bionic Python version: 3.7.10 PyTorch version (GPU?): 1.7.1+cu101 (False) Tensorflow version (GPU?): 2.4.1 (False) Using GPU in script?: True/False Using distributed or parallel set-up in script?: False ### Who can help - longformer, reformer, transfoxl, xlnet: @patrickvonplaten - tokenizers: @LysandreJik - trainer: @sgugger ## Information Model I am using (Bert, XLNet ...): `LongFormer` The problem arises when using: * [x] the official example scripts: (give details below) * [ ] my own modified scripts: (give details below) The tasks I am working on is: * [ ] an official GLUE/SQUaD task: (give the name) * [x] my own task or dataset: (give details below) ## To reproduce Hi, I am trying to train an LM model on a custom dataset (which is simply text over multiple lines). My choice was the Longformer, and I am using the exact same code provided officially with just a few modifications. When I fine-tune it on a custom dataset, I am getting this error:- ```py --------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-54-2f2d9c2c00fc> in <module>() 45 ) 46 ---> 47 train_results = trainer.train() 6 frames /usr/local/lib/python3.7/dist-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs) 1032 self.control = self.callback_handler.on_epoch_begin(self.args, self.state, self.control) 1033 -> 1034 for step, inputs in enumerate(epoch_iterator): 1035 1036 # Skip past any already trained steps if resuming training /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in __next__(self) 515 if self._sampler_iter is None: 516 self._reset() --> 517 data = self._next_data() 518 self._num_yielded += 1 519 if self._dataset_kind == _DatasetKind.Iterable and \ /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _next_data(self) 555 def _next_data(self): 556 index = self._next_index() # may raise StopIteration --> 557 data = self._dataset_fetcher.fetch(index) # may raise StopIteration 558 if self._pin_memory: 559 data = _utils.pin_memory.pin_memory(data) /usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index) 42 def fetch(self, possibly_batched_index): 43 if self.auto_collation: ---> 44 data = [self.dataset[idx] for idx in possibly_batched_index] 45 else: 46 data = self.dataset[possibly_batched_index] /usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0) 42 def fetch(self, possibly_batched_index): 43 if self.auto_collation: ---> 44 data = [self.dataset[idx] for idx in possibly_batched_index] 45 else: 46 data = self.dataset[possibly_batched_index] <ipython-input-53-5e4959dcf50c> in __getitem__(self, idx) 7 8 def __getitem__(self, idx): ----> 9 item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()} 10 item['labels'] = torch.tensor(self.labels[idx]) 11 return item <ipython-input-53-5e4959dcf50c> in <dictcomp>(.0) 7 8 def __getitem__(self, idx): ----> 9 item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()} 10 item['labels'] = torch.tensor(self.labels[idx]) 11 return item IndexError: list index out of range ``` Most probably it is a tokenization problem, but I can't seem to locate it. I ensured that the tokenizer in the **LM** does accept an appropriate length (even if it is quite bigger than I want): `tokenizer = LongformerTokenizerFast.from_pretrained("./ny_model", max_len=3500)` For fine-tuning, I ensured that it would truncate&pad, though none of my data samples are long enough to truncate: ```py train_encodings = tokenizer(list(train_text), truncation=True, padding=True, max_length=3500) val_encodings = ..... ``` Finally, I tried with some dummy data with _fixed length_ like this: ```py train_text = ['a', 'b'] val_text = ['c', 'd'] ``` Which rules out most tokenization errors. I am fine-tuning in accordance to official scripts - something I have done before. the LM looks good to me and tokenizes individually as well, so I have no reason to suspect it. I am attaching my **LM** code:- ```py !pip install -q git+https://github.com/huggingface/transformers !pip list | grep -E 'transformers|tokenizers' %%time from pathlib import Path from tokenizers import ByteLevelBPETokenizer # Initialize a tokenizer tokenizer = ByteLevelBPETokenizer() # Customize training tokenizer.train(files='./NYA.txt', vocab_size=52_000, min_frequency=2, special_tokens=[ "<s>", "<pad>", "</s>", "<unk>", "<mask>", ]) !mkdir ny_model tokenizer.save_model("ny_model") from transformers import LongformerConfig config = LongformerConfig( vocab_size=52_000, max_position_embeddings=514, num_attention_heads=2, num_hidden_layers=1, type_vocab_size=1, ) from transformers import LongformerTokenizerFast tokenizer = LongformerTokenizerFast.from_pretrained("./ny_model", max_len=3500) from transformers import LongformerForMaskedLM model = LongformerForMaskedLM(config=config) %%time from transformers import LineByLineTextDataset dataset = LineByLineTextDataset( tokenizer=tokenizer, file_path="./NYA.txt", block_size=128, ) from transformers import DataCollatorForLanguageModeling data_collator = DataCollatorForLanguageModeling( tokenizer=tokenizer, mlm=True, mlm_probability=0.15 ) from transformers import Trainer, TrainingArguments training_args = TrainingArguments( output_dir="./results", overwrite_output_dir=True, num_train_epochs=2, per_device_train_batch_size=64, save_steps=10_000, save_total_limit=2, prediction_loss_only=True, learning_rate=1e-5, logging_steps=50, fp16=True ) trainer = Trainer( model=model, args=training_args, train_dataset=dataset, data_collator=data_collator ) trainer.train() ``` and as said again, the fine-tuning part is again just like the official scripts, save the tokenizer arguments and some simple training args. I believe that this code with a simple dummy dataset could reproduce the bug. I can provide further help on the gist if someone can create one for full reproducibility. If there is some idiotic mistake I have made, please don't hesitate to point that out. > Any Ideas what the problem might be? Cheers

samchain · October 2, 2024, 9:07am

Hey,

If you use the ‘.predict’ method, then make sure to use the model within the Trainer. The Trainer itself is a wrapper around a model so maybe the predict in trainer is not the same that the predict in the model.

To check it you could :

test_tokens = tokenizer("I am writing this post", return_tensors = "pt" ) #check format of the tensors
trainer.model.predict(test_tokens)

It might be something else but sometimes people get confused between the Trainer class and the model itself.

Topic		Replies	Views
Error in fine-tuning BERT Beginners	8	6239	February 21, 2022
IndexError: list index out of range, when trying to predict from the fine tuned model Beginners	0	102	July 20, 2024
Getting IndexError: list index out of range when fine-tuning 🤗Transformers	7	10180	February 23, 2025
IndexError: tuple index out of range Beginners	0	1431	October 21, 2021
IndexError: index out of range in self - Text Generation with GPT2 Beginners	2	5778	November 27, 2023

Not able to predict using Transformers Trainer class

Related topics