KeyError: 'loss' while training QnA

I was finetuning BertForQuestionAnswering on nlp squad dateset with the following arguments

training_args = TrainingArguments(
    "test-qa-squad",
    learning_rate=2e-5,
    weight_decay=0.01,
    label_names = ["start_positions", "end_positions"],
    num_train_epochs=5,
    load_best_model_at_end=True,
    evaluation_strategy='epoch'
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dl,
    eval_dataset=train_dl
)

Then doing trainer.train() trains for some batches but then after a specific batch throws this error (one epoch isn’t complete yet)

KeyError                                  Traceback (most recent call last)

<ipython-input-19-3435b262f1ae> in <module>()
----> 1 trainer.train()

3 frames

/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py in __getitem__(self, k)
   1444         if isinstance(k, str):
   1445             inner_dict = {k: v for (k, v) in self.items()}
-> 1446             return inner_dict[k]
   1447         else:
   1448             return self.to_tuple()[k]

KeyError: 'loss'

Is this some issue in the dataset? Any help is much appreciated

You should double check your datasets has items that are dictionaries with the keys "start_positions", "end_positions" (that may be why the model is not returning the loss).

Also, you seem to be passing dataloaders to the Trainer? It takes datasets.

Lastly, for easy debug you can do the following:

for batch in trainer.get_train_dataloader():
    break
batch = {k: v.cuda() for k, v in batch.items()}
outputs = trainer.model(**batch)

to easily inspect what’s in your batch and your outputs.

2 Likes

@sgugger I am running into a similar problem KeyError: 'loss' my dataset does have the items as dictionaries (see image)

and my code is as follows:

from transformers import Trainer, TrainingArguments
batch_size = 64
logging_steps = len(dataset["train"]) // batch_size
model_name = f"{model_ckpt}-finetuned-test"
training_args = TrainingArguments(output_dir=model_name,
                                  num_train_epochs=2,
                                  learning_rate=2e-5,
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,
                                  weight_decay=0.01,
                                  evaluation_strategy="epoch",
                                  disable_tqdm=False,
                                  logging_steps=logging_steps,
                                  label_names = ['CategoryCode'],
                                  #push_to_hub=True, 
                                  log_level="error")


trainer = Trainer(model=model, 
                  args=training_args, 
                  compute_metrics=compute_metrics,
                  train_dataset=dataset["train"],
                  eval_dataset=dataset["vald"],
                  tokenizer=tokenizer)
trainer.train();

Note: I am running the above mentioned code locally Mac M1.