IndexError: Invalid key: 16 is out of bounds for size 0

I am trying to generate a large dataset for fine-tuning a Wav2Vec2 model.

It’s important that the data is not getting cached in-memory and I am still not sure if this is going to be the case the way I am doing it.

However, I have managed to generate a small dataset myself (timit in this case) but as the training starts I am getting the following exception:

  File "/home/sfalk/miniconda3/envs/speech/lib/python3.9/site-packages/transformers/trainer.py", line 1290, in train
    for step, inputs in enumerate(epoch_iterator):
  File "/home/sfalk/miniconda3/envs/speech/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/home/sfalk/miniconda3/envs/speech/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 561, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/sfalk/miniconda3/envs/speech/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/sfalk/miniconda3/envs/speech/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/sfalk/miniconda3/envs/speech/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 1857, in __getitem__
    return self._getitem(
  File "/home/sfalk/miniconda3/envs/speech/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 1849, in _getitem
    pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
  File "/home/sfalk/miniconda3/envs/speech/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 462, in query_table
    _check_valid_index_key(key, size)
  File "/home/sfalk/miniconda3/envs/speech/lib/python3.9/site-packages/datasets/formatting/formatting.py", line 405, in _check_valid_index_key
    raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
IndexError: Invalid key: 16 is out of bounds for size 0
  0%|                                                   | 0/700 [00:00<?, ?it/s]

This is the implementation of my GeneratorBasedBuilder:

class NewDataset(datasets.GeneratorBasedBuilder):

    VERSION: datasets.Version = datasets.Version("0.0.1")

    BUILDER_CONFIGS = [
        datasets.BuilderConfig(
            version=VERSION, description="This part of my dataset covers a first domain"
        ),
    ]

    def _info(self):

        features = datasets.Features(
            {
                "inputs": datasets.features.Sequence(datasets.Value("int16")),
                "targets": datasets.Value("string"),
                "length": datasets.Value("int64"),
            }
        )

        return datasets.DatasetInfo(
            description=_DESCRIPTION,
            features=features,
            homepage=_HOMEPAGE,
            license=_LICENSE,
            citation=_CITATION,
        )

    def _split_generators(self, dl_manager):
        return [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                gen_kwargs={
                    "filepath": "/mariana/asr/corpora/converted/en/timit_train",
                    "split": "train",
                },
            ),
            datasets.SplitGenerator(
                name=datasets.Split.TEST,
                gen_kwargs={
                    "filepath": "/mariana/asr/corpora/converted/en/timit_test",
                    "split": "test"
                },
            ),
            datasets.SplitGenerator(
                name=datasets.Split.VALIDATION,
                gen_kwargs={
                    "filepath": "/mariana/asr/corpora/converted/en/timit_dev",
                    "split": "dev",
                },
            ),
        ]

    def _generate_examples(self, filepath, split):
        corpus = ConvertedCorpus(filepath)
        for i, record in enumerate(corpus.sample_generator()):
            key = "/".join((str(record.speaker_id), str(record.sample_id)))
            yield key, dict(inputs=record.wav, targets=record.transcript, length=len(record.wav))
            if i >= 100:
                break

This is the content of the cache directory:

$ ls -lah
total 30M
drwxrwxr-x 2 sfalk sfalk 4.0K Feb  1 13:28 .
drwxrwxr-x 3 sfalk sfalk 4.0K Feb  1 13:28 ..
-rw-rw-r-- 1 sfalk sfalk 1.2K Feb  1 13:28 dataset_info.json
-rw-rw-r-- 1 sfalk sfalk    0 Feb  1 13:28 LICENSE
-rw-rw-r-- 1 sfalk sfalk  11M Feb  1 13:28 new_dataset-test.arrow
-rw-rw-r-- 1 sfalk sfalk  10M Feb  1 13:28 new_dataset-train.arrow
-rw-rw-r-- 1 sfalk sfalk 9.3M Feb  1 13:28 new_dataset-validation.arrow

And this here is the dataset_info.json:

{
  "description": "This new dataset is designed to solve this great NLP task and is crafted with a lot of care.\n",
  "citation": "@InProceedings{huggingface:dataset,\ntitle = {A great new dataset},\nauthor={huggingface, Inc.\n},\nyear={2020}\n}\n",
  "homepage": "",
  "license": "",
  "features": {
    "inputs": {
      "feature": {
        "dtype": "int16",
        "id": null,
        "_type": "Value"
      },
      "length": -1,
      "id": null,
      "_type": "Sequence"
    },
    "targets": {
      "dtype": "string",
      "id": null,
      "_type": "Value"
    },
    "length": {
      "dtype": "int64",
      "id": null,
      "_type": "Value"
    }
  },
  "post_processed": null,
  "supervised_keys": null,
  "task_templates": null,
  "builder_name": "new_dataset",
  "config_name": "default",
  "version": {
    "version_str": "0.0.1",
    "description": null,
    "major": 0,
    "minor": 0,
    "patch": 1
  },
  "splits": {
    "train": {
      "name": "train",
      "num_bytes": 10383006,
      "num_examples": 101,
      "dataset_name": "new_dataset"
    },
    "test": {
      "name": "test",
      "num_bytes": 10771888,
      "num_examples": 101,
      "dataset_name": "new_dataset"
    },
    "validation": {
      "name": "validation",
      "num_bytes": 9742303,
      "num_examples": 101,
      "dataset_name": "new_dataset"
    }
  },
  "download_checksums": {},
  "download_size": 0,
  "post_processing_size": null,
  "dataset_size": 30897197,
  "size_in_bytes": 30897197
}

On additional interesting observation here: There is a replays field on the MemoryMappedTable (+sigh+) object. It looks like all feature columns have been dropped?

Hi ! It looks like your code dropped the columns at one point. Which script are you using ?

I know that the Trainer class from transformers does drop the columns that are not named after actual inputs of the model you want to use, could it be because of that ?

1 Like

Hello, I am facing the same issues. Here is a minimal code that is failing.

import librosa

import torch
from transformers import Wav2Vec2CTCTokenizer, Wav2Vec2Processor, Wav2Vec2FeatureExtractor, Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM
from transformers import DataCollatorForTokenClassification
from transformers import Trainer, TrainingArguments
from datasets import load_dataset, load_metric



import pdb 



# load tokenizer and model # see the doc here: https://huggingface.co/docs/transformers/model_doc/wav2vec2#wav2vec2
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h") 
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h") 
  

# load dataset and more
dataset =load_dataset('hf-internal-testing/librispeech_asr_dummy','clean')
data_collator = DataCollatorForTokenClassification(tokenizer=processor.feature_extractor, padding=True) 
wer_metric = load_metric("wer")

# training
training_args = TrainingArguments(output_dir=f"wav2vec2")



trainer = Trainer(
    model=model,
    data_collator=data_collator,
    args=training_args,
    train_dataset=dataset['validation'],
    tokenizer=processor.feature_extractor
)

trainer.train()

And here is the error message:

Any idea of what is wrong?

1 Like

Hi ! You are using the transformers Trainer. Please note that the trainer does drop all the dataset columns that are not actual input to the models for training. If the dataset ends up with no columns, its size becomes zero.

In particular in your case, the Trainer must have logged

***** Running training *****
  Num examples = 0
  Num Epochs = 3
  ...

Can you try setting remove_unused_columns=False in the training arguments ?

16 Likes

Thank you, it solved the invalid key issue.

4 Likes

I am also facing the same issue, check the below code.

from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
import os
from datasets import load_dataset
os.environ[β€œWANDB_DISABLED”] = β€œtrue”

tokenizer = AutoTokenizer.from_pretrained(β€œEleutherAI/gpt-j-6B”, revision=β€œfloat16”, low_cpu_mem_usage=True)
model = AutoModelForCausalLM.from_pretrained(β€œEleutherAI/gpt-j-6B”, revision=β€œfloat16”, low_cpu_mem_usage=True)
train_dataset = load_dataset(β€˜D:\Vinoth\Finetune_GPTNEO_GPTJ6B\cp’)

training_args = TrainingArguments(
output_dir=β€˜results’,
num_train_epochs=1,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
warmup_steps=500,
weight_decay=0.01,
logging_dir=β€˜logs’,

)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset[β€œtrain”]

)

trainer.train()

Same issue here. Very annoying. If trainer decides to change/drop anything, it should log what it does, and WHY it does it. What file:line is the code that does that removal?

it’s _remove_unused_columns() in trainer.py in transformers

Im facing a similar issue currently. I created a dataset out of a folder in my drive as follows:

import os
import librosa
import pandas as pd
from datasets import Dataset

Step 2: Read file paths

def get_file_paths(audio_folder, transcript_folder):
audio_files = sorted([os.path.join(audio_folder, f) for f in os.listdir(audio_folder) if f.endswith(β€˜.wav’)])
transcript_files = sorted([os.path.join(transcript_folder, f) for f in os.listdir(transcript_folder) if f.endswith(β€˜.txt’)])
return audio_files, transcript_files

Step 3: Load and align audio and transcripts

def load_and_align_data(audio_files, transcript_files):
data =
for audio_file, transcript_file in zip(audio_files, transcript_files):
with open(transcript_file, β€˜r’) as f:
transcript = f.read().strip().upper()
data.append({β€˜audio’: audio_file, β€˜text’: transcript})
return data

Step 4: Apply alignment function

audio_folder = β€œ/content/drive/MyDrive/TrainingDataset/Audio”
transcript_folder = β€œ/content/drive/MyDrive/TrainingDataset/Transcripts”
audio_files, transcript_files = get_file_paths(audio_folder, transcript_folder)
aligned_data = load_and_align_data(audio_files, transcript_files)

Step 5: Create custom dataset

custom_dataset = Dataset.from_dict({β€œaudio”: [d[β€œaudio”] for d in aligned_data], β€œtext”: [d[β€œtext”] for d in aligned_data]})

loaded the models and defined the training arguments:
from transformers import TrainingArguments, Trainer, Wav2Vec2ForCTC, Wav2Vec2Tokenizer, Wav2Vec2Processor

#load pre-trained model, tokenizer, processor
tokenizer = Wav2Vec2Tokenizer.from_pretrained(β€œfacebook/wav2vec2-base-960h”)
model = Wav2Vec2ForCTC.from_pretrained(β€œfacebook/wav2vec2-base-960h”)
processor = Wav2Vec2Processor.from_pretrained(β€œfacebook/wav2vec2-base-960h”)

split_ratio = 0.1
num_samples = len(custom_dataset)
train_dataset = custom_dataset.select(range(int(num_samples * (1 - split_ratio))))
val_dataset = custom_dataset.select(range(int(num_samples * (1 - split_ratio)), num_samples))

Step 4: Define the training arguments

training_args = TrainingArguments(
remove_unused_columns=False,
output_dir=β€œ/content/drive/MyDrive/ASR_Results”, # output directory for the checkpoints and evaluation results
evaluation_strategy=β€œsteps”, # evaluation strategy to adopt during training
eval_steps=500, # number of steps between evaluations on the validation set
save_total_limit=2, # limit the total amount of checkpoints
learning_rate=3e-4, # learning rate for the optimizer
per_device_train_batch_size=4, # batch size for training
per_device_eval_batch_size=4, # batch size for evaluation
num_train_epochs=5, # total number of training epochs
weight_decay=0.01, # weight decay for regularization
push_to_hub=False,
logging_dir=β€œ./logs”, # directory for storing logs
logging_steps=500, # number of steps between logging messages
)

from datasets import load_metric
from transformers import DataCollatorForTokenClassification

print(len(train_dataset))
print(len(val_dataset))

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer, padding=True)

training

training_args = TrainingArguments(output_dir=f"wav2vec2")

Step 5: Instantiate the Trainer class

trainer = Trainer(
model=model,
data_collator=data_collator,
args=training_args,
tokenizer=tokenizer,
train_dataset=train_dataset,

)

Step 6: Train the model

trainer.train()

please help, it would mean a lot :disappointed_relieved:

i’m getting the same issue . with custom dataset .can anyone explain the what this function does here exactly _check_valid_index_key.

i’m running the example from one of the huggingface blogpost

the dataset is custom and looks like the following when loaded via Dataset class

print(dataset[0])
{'text': "### Human: What is Sarina Landcare Catchment Management Association's ABN? ### Assistant: 75953668479."}

following is the error trace .

β”‚ in <cell line: 3>:3                                                                              β”‚
β”‚                                                                                                  β”‚
β”‚ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1645 in train                    β”‚
β”‚                                                                                                  β”‚
β”‚   1642 β”‚   β”‚   inner_training_loop = find_executable_batch_size(                                 β”‚
β”‚   1643 β”‚   β”‚   β”‚   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  β”‚
β”‚   1644 β”‚   β”‚   )                                                                                 β”‚
β”‚ ❱ 1645 β”‚   β”‚   return inner_training_loop(                                                       β”‚
β”‚   1646 β”‚   β”‚   β”‚   args=args,                                                                    β”‚
β”‚   1647 β”‚   β”‚   β”‚   resume_from_checkpoint=resume_from_checkpoint,                                β”‚
β”‚   1648 β”‚   β”‚   β”‚   trial=trial,                                                                  β”‚
β”‚                                                                                                  β”‚
β”‚ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:1659 in _inner_training_loop     β”‚
β”‚                                                                                                  β”‚
β”‚   1656 β”‚   β”‚   self._train_batch_size = batch_size                                               β”‚
β”‚   1657 β”‚   β”‚   logger.debug(f"Currently training with a batch size of: {self._train_batch_size}  β”‚
β”‚   1658 β”‚   β”‚   # Data loader and number of training steps                                        β”‚
β”‚ ❱ 1659 β”‚   β”‚   train_dataloader = self.get_train_dataloader()                                    β”‚
β”‚   1660 β”‚   β”‚                                                                                     β”‚
β”‚   1661 β”‚   β”‚   # Setting up training control variables:                                          β”‚
β”‚   1662 β”‚   β”‚   # number of training epochs: num_train_epochs                                     β”‚
β”‚                                                                                                  β”‚
β”‚ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:919 in get_train_dataloader      β”‚
β”‚                                                                                                  β”‚
β”‚    916 β”‚   β”‚   β”‚   β”‚   pin_memory=self.args.dataloader_pin_memory,                               β”‚
β”‚    917 β”‚   β”‚   β”‚   )                                                                             β”‚
β”‚    918 β”‚   β”‚                                                                                     β”‚
β”‚ ❱  919 β”‚   β”‚   train_sampler = self._get_train_sampler()                                         β”‚
β”‚    920 β”‚   β”‚                                                                                     β”‚
β”‚    921 β”‚   β”‚   return DataLoader(                                                                β”‚
β”‚    922 β”‚   β”‚   β”‚   train_dataset,                                                                β”‚
β”‚                                                                                                  β”‚
β”‚ /usr/local/lib/python3.10/dist-packages/transformers/trainer.py:841 in _get_train_sampler        β”‚
β”‚                                                                                                  β”‚
β”‚    838 β”‚   β”‚   β”‚   β”‚   lengths = None                                                            β”‚
β”‚    839 β”‚   β”‚   β”‚   model_input_name = self.tokenizer.model_input_names[0] if self.tokenizer is   β”‚
β”‚    840 β”‚   β”‚   β”‚   if self.args.world_size <= 1:                                                 β”‚
β”‚ ❱  841 β”‚   β”‚   β”‚   β”‚   return LengthGroupedSampler(                                              β”‚
β”‚    842 β”‚   β”‚   β”‚   β”‚   β”‚   self.args.train_batch_size * self.args.gradient_accumulation_steps,   β”‚
β”‚    843 β”‚   β”‚   β”‚   β”‚   β”‚   dataset=self.train_dataset,                                           β”‚
β”‚    844 β”‚   β”‚   β”‚   β”‚   β”‚   lengths=lengths,                                                      β”‚
β”‚                                                                                                  β”‚
β”‚ /usr/local/lib/python3.10/dist-packages/transformers/trainer_pt_utils.py:571 in __init__         β”‚
β”‚                                                                                                  β”‚
β”‚    568 β”‚   β”‚   if lengths is None:                                                               β”‚
β”‚    569 β”‚   β”‚   β”‚   model_input_name = model_input_name if model_input_name is not None else "in  β”‚
β”‚    570 β”‚   β”‚   β”‚   if (                                                                          β”‚
β”‚ ❱  571 β”‚   β”‚   β”‚   β”‚   not (isinstance(dataset[0], dict) or isinstance(dataset[0], BatchEncodin  β”‚
β”‚    572 β”‚   β”‚   β”‚   β”‚   or model_input_name not in dataset[0]                                     β”‚
β”‚    573 β”‚   β”‚   β”‚   ):                                                                            β”‚
β”‚    574 β”‚   β”‚   β”‚   β”‚   raise ValueError(                                                         β”‚
β”‚                                                                                                  β”‚
β”‚ /usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py:2792 in __getitem__            β”‚
β”‚                                                                                                  β”‚
β”‚   2789 β”‚                                                                                         β”‚
β”‚   2790 β”‚   def __getitem__(self, key):  # noqa: F811                                             β”‚
β”‚   2791 β”‚   β”‚   """Can be used to index columns (by string names) or rows (by integer index or i  β”‚
β”‚ ❱ 2792 β”‚   β”‚   return self._getitem(key)                                                         β”‚
β”‚   2793 β”‚                                                                                         β”‚
β”‚   2794 β”‚   def __getitems__(self, keys: List) -> List:                                           β”‚
β”‚   2795 β”‚   β”‚   """Can be used to get a batch using a list of integers indices."""                β”‚
β”‚                                                                                                  β”‚
β”‚ /usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py:2776 in _getitem               β”‚
β”‚                                                                                                  β”‚
β”‚   2773 β”‚   β”‚   format_kwargs = kwargs["format_kwargs"] if "format_kwargs" in kwargs else self._  β”‚
β”‚   2774 β”‚   β”‚   format_kwargs = format_kwargs if format_kwargs is not None else {}                β”‚
β”‚   2775 β”‚   β”‚   formatter = get_formatter(format_type, features=self._info.features, **format_kw  β”‚
β”‚ ❱ 2776 β”‚   β”‚   pa_subtable = query_table(self._data, key, indices=self._indices if self._indice  β”‚
β”‚   2777 β”‚   β”‚   formatted_output = format_table(                                                  β”‚
β”‚   2778 β”‚   β”‚   β”‚   pa_subtable, key, formatter=formatter, format_columns=format_columns, output  β”‚
β”‚   2779 β”‚   β”‚   )                                                                                 β”‚
β”‚                                                                                                  β”‚
β”‚ /usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py:583 in query_table     β”‚
β”‚                                                                                                  β”‚
β”‚   580 β”‚   β”‚   _check_valid_column_key(key, table.column_names)                                   β”‚
β”‚   581 β”‚   else:                                                                                  β”‚
β”‚   582 β”‚   β”‚   size = indices.num_rows if indices is not None else table.num_rows                 β”‚
β”‚ ❱ 583 β”‚   β”‚   _check_valid_index_key(key, size)                                                  β”‚
β”‚   584 β”‚   # Query the main table                                                                 β”‚
β”‚   585 β”‚   if indices is None:                                                                    β”‚
β”‚   586 β”‚   β”‚   pa_subtable = _query_table(table, key)                                             β”‚
β”‚                                                                                                  β”‚
β”‚ /usr/local/lib/python3.10/dist-packages/datasets/formatting/formatting.py:526 in                 β”‚
β”‚ _check_valid_index_key                                                                           β”‚
β”‚                                                                                                  β”‚
β”‚   523 def _check_valid_index_key(key: Union[int, slice, range, Iterable], size: int) -> None:    β”‚
β”‚   524 β”‚   if isinstance(key, int):                                                               β”‚
β”‚   525 β”‚   β”‚   if (key < 0 and key + size < 0) or (key >= size):                                  β”‚
β”‚ ❱ 526 β”‚   β”‚   β”‚   raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")       β”‚
β”‚   527 β”‚   β”‚   return                                                                             β”‚
β”‚   528 β”‚   elif isinstance(key, slice):                                                           β”‚
β”‚   529 β”‚   β”‚   pass                                                                               β”‚
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
IndexError: Invalid key: 0 is out of bounds for size 0. 
1 Like

@envizzion I get the same error when replacing the timdettmers/openassistant-guanaco dataset used in the finetuning code. I created a custom dataset oriented at the format of the openassistant, saved it as jsonl, then loaded it with datasets and saved in arrow format.
When i load from disk:

# downloaded openassistant dataset
dataset1 = load_from_disk('./data/out/openassistant')
# mine
dataset2 = load_from_disk('./data/out/combined')

print(dataset1)
print(dataset2)

print(dataset1[0])
print(dataset2[0])

it looks like this:

Dataset({
    features: ['text'],
    num_rows: 9846
})
Dataset({
    features: ['text'],
    num_rows: 1200
})
{'text': '### Human: Can you write a short introduction about the relevance of the term "monopsony" in economics? Please use examples related to potential monopsonies in the labour market and cite relevant research.### Assistant: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.\n\nRecent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining power, leading to a situation where they are dependent on the employer for their livelihood. This dependence can result in further suppression of wages and a decline in working conditions.\n\nOverall, the concept of monopsony is essential to understanding the dynamics of labor markets and the impact of market power on workers. Further research is needed to understand the extent and impact of monopsonies on the economy and to develop policies to address this issue.\n\nReferences:\nBivens, J., & Mishel, L. (2013). The Pay of Corporate Executives and Financial Professionals as Evidence of Rents in Top 1 Percent Incomes. Journal of Economic Perspectives, 27(3), 57-78.### Human: Now explain it to a dog'}
{'text': 'Frage: Warum sind geschΓ€tzte und tatsΓ€chliche Kosten bei Bauprojekten ein zentrales Thema? Antwort: Die geschΓ€tzten und tatsΓ€chlichen Kosten sind bei vielen Bauprojekten ein zentrales Thema, da der Bauherr eine hohe Investitionssumme einsetzen muss, welche in vielen FΓ€llen die sonstigen Ausgaben um ein Vielfaches ΓΌbersteigt.'}

Using the openassistant dataset the training runs fine but using my own i get the same β€œIndexError: Invalid key: 0 is out of bounds for size 0.”.

Edit1: From the code in datasets/formatting/formatting.py i guess that the problem seems to be the size calculation in line 582. I think it shouldn’t be 0. The index check raises the error correctly since key>= size. see: datasets/src/datasets/formatting/formatting.py at 0d2b8854c265b4dc202e480427890f472b34ea15 Β· huggingface/datasets Β· GitHub

Edit2: In my case the problem is definitely the dataset structure of my own dataset. The column β€œtext” gets removed by the _remove_unused_columns function of the transformers library trainer (trainer.py). Just for comparison i printed out the column names in the function for the openassistant dataset the data seems to be in β€œinput_ids” while in mine it landed in β€œtext”

dataset column names: ['input_ids']
signature_columns: ['input_ids', 'attention_mask', 'inputs_embeds', 'labels', 'output_attentions', 'output_hidden_states', 'return_dict', 'kwargs', 'label', 'label_ids', 'labels']
ignored_columns: []
columns: ['input_ids']

dataset column names: ['text']
signature_columns: ['input_ids', 'attention_mask', 'inputs_embeds', 'labels', 'output_attentions', 'output_hidden_states', 'return_dict', 'kwargs', 'label_ids', 'label', 'labels']
ignored_columns: ['text']
columns: []
1 Like

You can get this info by setting the transformers logger’s verbosity level to info:

import transformers
transformers.logging.set_verbosity_info()
1 Like

Can you share the code that creates a dataset with which we can reproduce this error?

Hello, this isn’t the original user but we have the same scenario. I also used the guide to finetune Falcon (this time 40B instead of 7B) with my own custom dataset. The following code reproduces the exact error:

from datasets import load_dataset

dataset_name = "A-Alsad/SAT_clean"
dataset = load_dataset(dataset_name, split="train")

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

model_name = "tiiuae/falcon-40b"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False


tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

from peft import LoraConfig

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
    ]
)

from transformers import TrainingArguments

output_dir = "Falcon-40B-SAT"
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
save_steps = 10
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 500
warmup_ratio = 0.03
lr_scheduler_type = "constant"

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    push_to_hub=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
)

from trl import SFTTrainer

max_seq_length = 2800


trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)


for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

trainer.train()
trainer.push_to_hub()

From what I’ve seen, the issue seems to be in the calculation of the size in the _check_valid_index_key function under datasets/formatting/formatting.py.

@A-Alsad for the falcon fine-tune code i tracked back the problem to the _prepare_dataset() function in the trl SFT Trainer (trl/trainer/sft_trainer.py). It changes the dataset column names from β€œtext” to β€œinput_ids” for the openassistant dataset. With my custom dataset it didn’t do this so my β€œtext” column got removed (as described above) and therefore it was empty and producing the error. I have to investigate further but maybe this helps.

Edit1: I figured out the problem with my dataset. For me it had to do with the tokenizing in the _prepare_non_packed_dataloader() function in the SFTTrainer. If the max_seq_len is to high (i guess for the specific dataset) it can produce an empty input_batch. It warns about it but continues anyways and removes the original column. That’s basically the root of the error described above. Lowering the max_seq_len did the trick for me. I don’t know enough about the tokenization process, so maybe someone else could explain.

Edit2: The problems seems to be that the trl sft trainer filters out samples that are shorter than max_seq_len after tokenization. See the github issue posted below. So my fix above doesn not really solve the isseue.

1 Like

here’s a json used to create the dataset .

following is how it was loaded .

dataset = load_dataset('json',data_files='charity-qa-dataset-2.json', split="train")

1 Like

Do you have any ideas to solve this problem?

For everyone having the problem because of the trl library filtering out data as descibed above, here is the issue regarding this:

2 Likes

@moschimonsters Thanks for the info. Until the PR is merged I manually did the change and now it is working with no issue .

The problem here is with the β€œ_prepare_dataset()” function in β€œtrl/trainer/sft_trainer.py”. Instead, what the source code recommended was using β€œtrl.trainer.ConstantLengthDataset” to create datasets.

I changing my dataset to ConstantLengthDataset, removing β€œgroup_by_length”, β€œdataset_text_field” and added β€œpacking=True”.

from trl import SFTTrainer
from trl.trainer import ConstantLengthDataset

dataset = ConstantLengthDataset(
    tokenizer=tokenizer, 
    dataset=dataset, 
    dataset_text_field="text")

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
  # group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
  # dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=True # added this
)

Now the code should work.

1 Like