IndexError: Invalid key: 16 is out of bounds for size 0

serino28 · July 13, 2023, 4:13pm

tried but gives this error
“ValueError: You should supply an encoding or a list of encodings to this method that includes input_ids, but you provided ”
Has this ever happened to you?
If so, could you tell me if there is a specific formatting of the data in order to use this method?

lhoestq · July 13, 2023, 4:52pm

cc @ybelkada since it’s related to trl

lawbda · August 22, 2023, 12:05pm

I’ve solved the erro and things are more interesting and stupid:
Only when I use PEFT-lora to warppering base-model “gpt2” thise error will be raised. and the error messages are as follows:

The following columns in the training set don't have a corresponding argument in `PeftModel.forward` and have been ignored: input_ids, labels. If input_ids, labels are not expected by `PeftModel.forward`,  you can safely ignore this message.
***** Running training *****
  Num examples = 0
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Training with DataParallel so batch size has been adjusted to: 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 9,198
  Number of trainable parameters = 1,179,648

Index error ................
.................

Where Num examples = 0 means there’s nothing training data can be feed to model, or the Trainer just make a judgement that data the model refused is not the data we needed, then remove them(all of) out.

Here’s my definition of my trainer and model, with my custom dataset which contains two features:{'input_ids': tensor, 'labels':tensor }:

training_args = TrainingArguments(
    "gpt2-lora-dp-trainer",
    per_device_train_batch_size=args['batch-size'],
    per_device_eval_batch_size=args['eval-batch-size'],
    num_train_epochs=args['train-epoch'],
    evaluation_strategy="epoch",
    remove_unused_columns=False,
    )
peft_config = LoraConfig(
            peft_type = TaskType.CAUSAL_LM,
            base_model_name_or_path = model_name_or_path,
            r = args['lora-r'],
            lora_alpha = args['lora-alpha'],
            lora_dropout = args['lora-dropout']
            )
model = AutoModelForCausalLM.from_pretrained(model_name_or_path)

model = get_peft_model(model, 
                       peft_config)
trainer = MyTrainer(
    model = model,
    data_collator = default_data_collator,
    train_dataset = valid_dataset,
    eval_dataset = valid_dataset,
    optimizers = (optimizer, lr_scheduler),
)

BUT, when I remove the peft model warpper, just use base model gpt-2 as model, the error message has changed to:

***** Running training *****
  Num examples = 49,043
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Training with DataParallel so batch size has been adjusted to: 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 9,198
  Number of trainable parameters = 124,439,808

OutOfMemoryError                          Traceback (most recent call last)
.........................

Which means Trainer just accept the gpt-2’s model.forwad(**args) parameters protocal and refuse the PerfModel.forward() one.

These error still remains when I custom the trainer as:

# custom data feed method
class MyTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False):
        return model(
            input_ids=inputs["input_ids"],
            labels=inputs["labels"],
        ).loss

I try to mannualy feeding the data and label to the PerfModel, with nothing happened, the Index error still raises.

BUT, I noticed that I’ve forgot passing train args for Trainer. By fixing this stupid mistake, things back to the normal rail.

training_args = TrainingArguments(
    "gpt2-lora-dp-trainer",
    per_device_train_batch_size=args['batch-size'],
    per_device_eval_batch_size=args['eval-batch-size'],
    num_train_epochs=args['train-epoch'],
    evaluation_strategy="epoch",
    remove_unused_columns=False,
    )

trainer = MyTrainer(
    model = model,
    args = training_args,
    data_collator = default_data_collator,
    train_dataset = valid_dataset,
    eval_dataset = valid_dataset,
    optimizers = (optimizer, lr_scheduler),
)

Through the question is stupid, It’s deserve noticing that the compatibility between Trainer and Peftmodel is still not good.

Hope my story can help for tracing this error.

i4never · September 6, 2023, 3:24am

Just add remove_unused_columns=False to TrainingArguments

satyroffrost · November 19, 2023, 7:10pm

Holy Gradient! After all this thread you provide simple solution! It works well in case of problem with LoRA!

oskrmiguel · June 4, 2024, 10:13pm

Great solution is ready

arnoldmatt · June 5, 2024, 10:47am

Hi sfalk,

The error “IndexError: Invalid key: 16 is out of bounds for size 0” suggests your NewDataset might be empty. The training loop tries to access the 16th element (index 15), but there are zero elements in the dataset.

Double-check your _generate_examples function to ensure it’s yielding data correctly. You can also try printing the number of examples generated during the initial creation to verify.

Here are some additional tips for avoiding caching issues:

Consider using yield from to stream data directly from your generation process instead of building the entire dataset in memory.
If you must cache, explore libraries like dask or ray for parallel processing and memory management.

Topic		Replies	Views
IndexError: Invalid key: 0 is out of bounds for size 0 🤗Datasets	0	549	April 4, 2024
SFTTrainer and wikitext dataset Beginners	0	234	March 16, 2024
IndexError: Target 4 is out of bounds Beginners	1	737	May 2, 2023
Invalid Key Error when Training GPT2 from Scratch using trainer.train() 🤗Transformers	3	1536	April 15, 2024
Error training with iterabledatasets Beginners	1	633	July 22, 2022

IndexError: Invalid key: 16 is out of bounds for size 0

Related topics