KeyError: 'input_ids'. when training BERT with Trainer

mickeymnemonic · November 19, 2020, 12:50pm

greetings fam

just curious if anyone provide insight on the key error message (KeyError: ‘input_ids’.) i go to train my pretrained BertForMaskedLM model (using code: trainer_BERT.train()) via the huggingface Trainer on my Dataset object. not sure if it has to do with my creation of the dataset or how i am calling my model for training tho any insights are appreciated!!

a detailed view of my code and the key error is available at the link below.

thank you
mick

KeyError Traceback (most recent call last)
in
----> 1 trainer_BERT.train()
2 trainer.save_model(“./models/royalBERT”)

~/anaconda3/lib/python3.7/site-packages/transformers/trainer.py in train(self, model_path, trial)
755 self.control = self.callback_handler.on_epoch_begin(self.args, self.state, self.control)
756
→ 757 for step, inputs in enumerate(epoch_iterator):
758
759 # Skip past any already trained steps if resuming training

~/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in next(self)
361
362 def next(self):
→ 363 data = self._next_data()
364 self._num_yielded += 1
365 if self._dataset_kind == _DatasetKind.Iterable and \

~/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _next_data(self)
401 def _next_data(self):
402 index = self._next_index() # may raise StopIteration
→ 403 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
404 if self._pin_memory:
405 data = _utils.pin_memory.pin_memory(data)

~/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
45 else:
46 data = self.dataset[possibly_batched_index]
—> 47 return self.collate_fn(data)

~/anaconda3/lib/python3.7/site-packages/transformers/data/data_collator.py in call(self, examples)
193 ) → Dict[str, torch.Tensor]:
194 if isinstance(examples[0], (dict, BatchEncoding)):
→ 195 examples = [e[“input_ids”] for e in examples]
196 batch = self._tensorize_batch(examples)
197 if self.mlm:

~/anaconda3/lib/python3.7/site-packages/transformers/data/data_collator.py in (.0)
193 ) → Dict[str, torch.Tensor]:
194 if isinstance(examples[0], (dict, BatchEncoding)):
→ 195 examples = [e[“input_ids”] for e in examples]
196 batch = self._tensorize_batch(examples)
197 if self.mlm:

KeyError: ‘input_ids’

sgugger · November 19, 2020, 3:42pm

The line

tokenizerBERT(unlabelled_dataset['paragraphs'], padding=True, truncation=True)

is not stored anywhere, so you’re passing to the Trainer a dataset that hasn’t been tokenized.

mickeymnemonic · November 19, 2020, 3:56pm

ok, thank you. so simply store as a variable such as below?

tokenizerBERT = tokenizerBERT(unlabelled_dataset['paragraphs'], padding=True, truncation=True)

the reason i didn’t store it was because when i did it i could no longer save the output w/ the save_pretrained method such as shown below.

tokenizerBERT.save_pretrained(‘tokenizers/pytorch/labelled/BERT/’)

any thoughts?

sgugger · November 19, 2020, 3:59pm

If you store your datasets in the tokenizer it won’t work either. You need to store it in the variable you will send to Trainer a your train_dataset. Also I don’t know what Dataset class you are using, but tokenizer probably can’t take it directly.

mickeymnemonic · November 19, 2020, 7:01pm

i’m definitely missing something…curious to hear your views on what that is.

based on the HF documentation i thought it would be possible to simply pass an in memory dataframe:

unlabelled_dataset = Dataset.from_pandas(unlabelled_corpus) #create Dataset object for tokenization

take the relevant paragraphs column from the dataframe that i want to finetune the pretrained model on. this column has a few sentences or a paragraph on each row of the dataframe. and then pass it to the pretrained tokenizer after intialization:

tokenizerBERT = BertTokenizerFast.from_pretrained(‘bert-base-uncased’) #BERT model tokenization & check
encodingBERT = tokenizerBERT(unlabelled_dataset[‘paragraphs’], padding=True, truncation=True)

initializing the data collator w/ reference to the pretrained tokenizer:

data_collator_BERT = DataCollatorForLanguageModeling(tokenizer=tokenizerBERT, mlm=True, mlm_probability=0.15)

and then initializing the model, training arguements and trainer before calling it would do the trick:

model_BERT = BertForMaskedLM.from_pretrained(‘bert-base-uncased’)

training_args_BERT = TrainingArguments(
output_dir=’./BERT’,
num_train_epochs=10,
evaluation_strategy=‘steps’,
warmup_steps=10000,
weight_decay=0.01,
per_device_train_batch_size=64,
)

trainer_BERT = Trainer(
model=model_BERT,
args=training_args_BERT,
data_collator=data_collator_BERT,
train_dataset=encodingBERT,
)

trainer_BERT.train()
trainer.save_model(’./models/royalBERT’)

the features of the loaded dataset.from_pandas are provided below. i only am looking to train the model on the paragraphs column hence why i pass that column in the called pretrained tokenizer.

{‘index’: Value(dtype=‘string’, id=None), ‘filename’: Value(dtype=‘string’, id=None), ‘page-id’: Value(dtype=‘string’, id=None), ‘paragraph-id’: Value(dtype=‘string’, id=None), ‘paragraphs’: Value(dtype=‘string’, id=None), ‘index_level_0’: Value(dtype=‘int64’, id=None)}

lovelmark · July 21, 2021, 12:42pm

A KeyError means the key you gave pandas isn’t valid. Before doing anything with the data frame, use print(df.columns) to see column exist or not.

print(df.columns)

I was getting a similar kind of error in one of my codes. Turns out, that particular index was missing from my data frame as I had dropped the empty dataframe 2 rows. If this is the case, you can do df.reset_index(inplace=True) and the error should be resolved.

Topic		Replies	Views
Got KeyError('inputs') Beginners	3	838	November 19, 2020
Why am I getting KeyError: 'loss'? Beginners	9	16491	March 17, 2023
Loss error for bert token classifier Beginners	11	506	December 4, 2021
KeyError: 'loss' while training QnA Beginners	2	2559	March 17, 2022
KeyError: 'loss' during Fine Tuning bert-base-italian-cased for QA Beginners	3	1323	June 8, 2021

KeyError: 'input_ids'. when training BERT with Trainer

Related topics