IndexError: Invalid key: 16 is out of bounds for size 0

I am trying to generate a large dataset for fine-tuning a Wav2Vec2 model.

It’s important that the data is not getting cached in-memory and I am still not sure if this is going to be the case the way I am doing it.

However, I have managed to generate a small dataset myself (timit in this case) but as the training starts I am getting the following exception:

  File "/home/sfalk/miniconda3/envs/speech/lib/python3.9/site-packages/transformers/", line 1290, in train
    for step, inputs in enumerate(epoch_iterator):
  File "/home/sfalk/miniconda3/envs/speech/lib/python3.9/site-packages/torch/utils/data/", line 521, in __next__
    data = self._next_data()
  File "/home/sfalk/miniconda3/envs/speech/lib/python3.9/site-packages/torch/utils/data/", line 561, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/sfalk/miniconda3/envs/speech/lib/python3.9/site-packages/torch/utils/data/_utils/", line 49, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/sfalk/miniconda3/envs/speech/lib/python3.9/site-packages/torch/utils/data/_utils/", line 49, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/sfalk/miniconda3/envs/speech/lib/python3.9/site-packages/datasets/", line 1857, in __getitem__
    return self._getitem(
  File "/home/sfalk/miniconda3/envs/speech/lib/python3.9/site-packages/datasets/", line 1849, in _getitem
    pa_subtable = query_table(self._data, key, indices=self._indices if self._indices is not None else None)
  File "/home/sfalk/miniconda3/envs/speech/lib/python3.9/site-packages/datasets/formatting/", line 462, in query_table
    _check_valid_index_key(key, size)
  File "/home/sfalk/miniconda3/envs/speech/lib/python3.9/site-packages/datasets/formatting/", line 405, in _check_valid_index_key
    raise IndexError(f"Invalid key: {key} is out of bounds for size {size}")
IndexError: Invalid key: 16 is out of bounds for size 0
  0%|                                                   | 0/700 [00:00<?, ?it/s]

This is the implementation of my GeneratorBasedBuilder:

class NewDataset(datasets.GeneratorBasedBuilder):

    VERSION: datasets.Version = datasets.Version("0.0.1")

            version=VERSION, description="This part of my dataset covers a first domain"

    def _info(self):

        features = datasets.Features(
                "inputs": datasets.features.Sequence(datasets.Value("int16")),
                "targets": datasets.Value("string"),
                "length": datasets.Value("int64"),

        return datasets.DatasetInfo(

    def _split_generators(self, dl_manager):
        return [
                    "filepath": "/mariana/asr/corpora/converted/en/timit_train",
                    "split": "train",
                    "filepath": "/mariana/asr/corpora/converted/en/timit_test",
                    "split": "test"
                    "filepath": "/mariana/asr/corpora/converted/en/timit_dev",
                    "split": "dev",

    def _generate_examples(self, filepath, split):
        corpus = ConvertedCorpus(filepath)
        for i, record in enumerate(corpus.sample_generator()):
            key = "/".join((str(record.speaker_id), str(record.sample_id)))
            yield key, dict(inputs=record.wav, targets=record.transcript, length=len(record.wav))
            if i >= 100:

This is the content of the cache directory:

$ ls -lah
total 30M
drwxrwxr-x 2 sfalk sfalk 4.0K Feb  1 13:28 .
drwxrwxr-x 3 sfalk sfalk 4.0K Feb  1 13:28 ..
-rw-rw-r-- 1 sfalk sfalk 1.2K Feb  1 13:28 dataset_info.json
-rw-rw-r-- 1 sfalk sfalk    0 Feb  1 13:28 LICENSE
-rw-rw-r-- 1 sfalk sfalk  11M Feb  1 13:28 new_dataset-test.arrow
-rw-rw-r-- 1 sfalk sfalk  10M Feb  1 13:28 new_dataset-train.arrow
-rw-rw-r-- 1 sfalk sfalk 9.3M Feb  1 13:28 new_dataset-validation.arrow

And this here is the dataset_info.json:

  "description": "This new dataset is designed to solve this great NLP task and is crafted with a lot of care.\n",
  "citation": "@InProceedings{huggingface:dataset,\ntitle = {A great new dataset},\nauthor={huggingface, Inc.\n},\nyear={2020}\n}\n",
  "homepage": "",
  "license": "",
  "features": {
    "inputs": {
      "feature": {
        "dtype": "int16",
        "id": null,
        "_type": "Value"
      "length": -1,
      "id": null,
      "_type": "Sequence"
    "targets": {
      "dtype": "string",
      "id": null,
      "_type": "Value"
    "length": {
      "dtype": "int64",
      "id": null,
      "_type": "Value"
  "post_processed": null,
  "supervised_keys": null,
  "task_templates": null,
  "builder_name": "new_dataset",
  "config_name": "default",
  "version": {
    "version_str": "0.0.1",
    "description": null,
    "major": 0,
    "minor": 0,
    "patch": 1
  "splits": {
    "train": {
      "name": "train",
      "num_bytes": 10383006,
      "num_examples": 101,
      "dataset_name": "new_dataset"
    "test": {
      "name": "test",
      "num_bytes": 10771888,
      "num_examples": 101,
      "dataset_name": "new_dataset"
    "validation": {
      "name": "validation",
      "num_bytes": 9742303,
      "num_examples": 101,
      "dataset_name": "new_dataset"
  "download_checksums": {},
  "download_size": 0,
  "post_processing_size": null,
  "dataset_size": 30897197,
  "size_in_bytes": 30897197

On additional interesting observation here: There is a replays field on the MemoryMappedTable (+sigh+) object. It looks like all feature columns have been dropped?

Hi ! It looks like your code dropped the columns at one point. Which script are you using ?

I know that the Trainer class from transformers does drop the columns that are not named after actual inputs of the model you want to use, could it be because of that ?

1 Like

Hello, I am facing the same issues. Here is a minimal code that is failing.

import librosa

import torch
from transformers import Wav2Vec2CTCTokenizer, Wav2Vec2Processor, Wav2Vec2FeatureExtractor, Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM
from transformers import DataCollatorForTokenClassification
from transformers import Trainer, TrainingArguments
from datasets import load_dataset, load_metric

import pdb 

# load tokenizer and model # see the doc here:
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h") 
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h") 

# load dataset and more
dataset =load_dataset('hf-internal-testing/librispeech_asr_dummy','clean')
data_collator = DataCollatorForTokenClassification(tokenizer=processor.feature_extractor, padding=True) 
wer_metric = load_metric("wer")

# training
training_args = TrainingArguments(output_dir=f"wav2vec2")

trainer = Trainer(


And here is the error message:

Any idea of what is wrong?

Hi ! You are using the transformers Trainer. Please note that the trainer does drop all the dataset columns that are not actual input to the models for training. If the dataset ends up with no columns, its size becomes zero.

In particular in your case, the Trainer must have logged

***** Running training *****
  Num examples = 0
  Num Epochs = 3

Can you try setting remove_unused_columns=False in the training arguments ?


Thank you, it solved the invalid key issue.