Error when training with `peft` + `lora`

devengqc · May 15, 2023, 2:19pm

Hello, I am trying to use the tutorial here, Google Colab and I’m finetuning it on a custom dataset. I am loading my dataset from a pandas dataframe and I’m not sure what the error means here. Can anyone help me with this? TIA!

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <cell line: 21>:21                                                                            │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1664 in train                    │
│                                                                                                  │
│   1661 │   │   inner_training_loop = find_executable_batch_size(                                 │
│   1662 │   │   │   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  │
│   1663 │   │   )                                                                                 │
│ ❱ 1664 │   │   return inner_training_loop(                                                       │
│   1665 │   │   │   args=args,                                                                    │
│   1666 │   │   │   resume_from_checkpoint=resume_from_checkpoint,                                │
│   1667 │   │   │   trial=trial,                                                                  │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1909 in _inner_training_loop     │
│                                                                                                  │
│   1906 │   │   │   │   rng_to_sync = True                                                        │
│   1907 │   │   │                                                                                 │
│   1908 │   │   │   step = -1                                                                     │
│ ❱ 1909 │   │   │   for step, inputs in enumerate(epoch_iterator):                                │
│   1910 │   │   │   │   total_batched_samples += 1                                                │
│   1911 │   │   │   │   if rng_to_sync:                                                           │
│   1912 │   │   │   │   │   self._load_rng_state(resume_from_checkpoint)                          │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:634 in __next__           │
│                                                                                                  │
│    631 │   │   │   if self._sampler_iter is None:                                                │
│    632 │   │   │   │   # TODO(https://github.com/pytorch/pytorch/issues/76750)                   │
│    633 │   │   │   │   self._reset()  # type: ignore[call-arg]                                   │
│ ❱  634 │   │   │   data = self._next_data()                                                      │
│    635 │   │   │   self._num_yielded += 1                                                        │
│    636 │   │   │   if self._dataset_kind == _DatasetKind.Iterable and \                          │
│    637 │   │   │   │   │   self._IterableDataset_len_called is not None and \                    │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:678 in _next_data         │
│                                                                                                  │
│    675 │                                                                                         │
│    676 │   def _next_data(self):                                                                 │
│    677 │   │   index = self._next_index()  # may raise StopIteration                             │
│ ❱  678 │   │   data = self._dataset_fetcher.fetch(index)  # may raise StopIteration              │
│    679 │   │   if self._pin_memory:                                                              │
│    680 │   │   │   data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)            │
│    681 │   │   return data                                                                       │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py:49 in fetch             │
│                                                                                                  │
│   46 │   def fetch(self, possibly_batched_index):                                                │
│   47 │   │   if self.auto_collation:                                                             │
│   48 │   │   │   if hasattr(self.dataset, "__getitems__") and self.dataset.__getitems__:         │
│ ❱ 49 │   │   │   │   data = self.dataset.__getitems__(possibly_batched_index)                    │
│   50 │   │   │   else:                                                                           │
│   51 │   │   │   │   data = [self.dataset[idx] for idx in possibly_batched_index]                │
│   52 │   │   else:                                                                               │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py:2782 in __getitems__           │
│                                                                                                  │
│   2779 │                                                                                         │
│   2780 │   def __getitems__(self, keys: List) -> List:                                           │
│   2781 │   │   """Can be used to get a batch using a list of integers indices."""                │
│ ❱ 2782 │   │   batch = self.__getitem__(keys)                                                    │
│   2783 │   │   n_examples = len(batch[next(iter(batch))])                                        │
│   2784 │   │   return [{col: array[i] for col, array in batch.items()} for i in range(n_example  │
│   2785                                                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py:2778 in __getitem__            │
│                                                                                                  │
│   2775 │                                                                                         │
│   2776 │   def __getitem__(self, key):  # noqa: F811                                             │
│   2777 │   │   """Can be used to index columns (by string names) or rows (by integer index or i  │
│ ❱ 2778 │   │   return self._getitem(key)                                                         │
│   2779 │                                                                                         │
│   2780 │   def __getitems__(self, keys: List) -> List:                                           │
│   2781 │   │   """Can be used to get a batch using a list of integers indices."""                │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py:2762 in _getitem               │
│                                                                                                  │
│   2759 │   │   format_kwargs = kwargs["format_kwargs"] if "format_kwargs" in kwargs else self._  │
│   2760 │   │   format_kwargs = format_kwargs if format_kwargs is not None else {}                │
│   2761 │   │   formatter = get_formatter(format_type, features=self._info.features, **format_kw  │
│ ❱ 2762 │   │   pa_subtable = query_table(self._data, key, indices=self._indices if self._indice  │
│   2763 │   │   formatted_output = format_table(                                                  │
│   2764 │   │   │   pa_subtable, key, formatter=formatter, format_columns=format_columns, output  │
│   2765 │   │   )                                                                                 │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py:578 in query_table     │
│                                                                                                  │
│   575 │   │   _check_valid_column_key(key, table.column_names)                                   │
│   576 │   else:                                                                                  │
│   577 │   │   size = indices.num_rows if indices is not None else table.num_rows                 │
│ ❱ 578 │   │   _check_valid_index_key(key, size)                                                  │
│   579 │   # Query the main table                                                                 │
│   580 │   if indices is None:                                                                    │
│   581 │   │   pa_subtable = _query_table(table, key)                                             │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py:531 in                 │
│ _check_valid_index_key                                                                           │
│                                                                                                  │
│   528 │   │   │   _check_valid_index_key(min(key), size=size)                                    │
│   529 │   elif isinstance(key, Iterable):                                                        │
│   530 │   │   if len(key) > 0:                                                                   │
│ ❱ 531 │   │   │   _check_valid_index_key(int(max(key)), size=size)                               │
│   532 │   │   │   _check_valid_index_key(int(min(key)), size=size)                               │
│   533 │   else:                                                                                  │
│   534 │   │   _raise_bad_key_type(key)                                                           │
│                                                                                                  │
│ /usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py:521 in                 │
│ _check_valid_index_key                                                                           │
│                                                                                                  │
│   518 def _check_valid_index_key(key: Union[int, slice, range, Iterable], size: int) -> None:    │
│   519 │   if isinstance(key, int):                                                               │
│   520 │   │   if (key < 0 and key + size < 0) or (key >= size):                                  │
│ ❱ 521 │   │   │   raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")       │
│   522 │   │   return                                                                             │
│   523 │   elif isinstance(key, slice):                                                           │
│   524 │   │   pass                                                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
IndexError: Invalid key: 19 is out of bounds for size 0

hajipour · August 25, 2023, 6:55am

I think there is a bug in the Trainer class (probably in _remove_unused_columns function). This error happens only when I use LoRA fine-tuning. For now, I resolved the error by setting remove_unused_columns=False in the TrainingArguments.

Topic		Replies	Views
IndexError: Invalid key: 16 is out of bounds for size 0 🤗Datasets	26	23433	June 5, 2024
Unable to run PEFT training due to IndexError: Invalid key: 14177 Beginners	1	303	December 30, 2023
PEFT tuning error Beginners	0	359	August 2, 2023
Getting KeyError: 203 when running trainer.train() 🤗Transformers	0	427	July 16, 2023
How to figure out corresponding arguments in PeftModel? Models	7	1069	February 16, 2024

Error when training with `peft` + `lora`

Related topics