KeyError: 'input_ids'. when training BERT with Trainer

greetings fam

just curious if anyone provide insight on the key error message (KeyError: ‘input_ids’.) i go to train my pretrained BertForMaskedLM model (using code: trainer_BERT.train()) via the huggingface Trainer on my Dataset object. not sure if it has to do with my creation of the dataset or how i am calling my model for training tho any insights are appreciated!!

a detailed view of my code and the key error is available at the link below.

thank you
mick

KeyError Traceback (most recent call last)
in
----> 1 trainer_BERT.train()
2 trainer.save_model("./models/royalBERT")

~/anaconda3/lib/python3.7/site-packages/transformers/trainer.py in train(self, model_path, trial)
755 self.control = self.callback_handler.on_epoch_begin(self.args, self.state, self.control)
756
–> 757 for step, inputs in enumerate(epoch_iterator):
758
759 # Skip past any already trained steps if resuming training

~/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in next(self)
361
362 def next(self):
–> 363 data = self._next_data()
364 self._num_yielded += 1
365 if self._dataset_kind == _DatasetKind.Iterable and \

~/anaconda3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _next_data(self)
401 def _next_data(self):
402 index = self._next_index() # may raise StopIteration
–> 403 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
404 if self._pin_memory:
405 data = _utils.pin_memory.pin_memory(data)

~/anaconda3/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
45 else:
46 data = self.dataset[possibly_batched_index]
—> 47 return self.collate_fn(data)

~/anaconda3/lib/python3.7/site-packages/transformers/data/data_collator.py in call(self, examples)
193 ) -> Dict[str, torch.Tensor]:
194 if isinstance(examples[0], (dict, BatchEncoding)):
–> 195 examples = [e[“input_ids”] for e in examples]
196 batch = self._tensorize_batch(examples)
197 if self.mlm:

~/anaconda3/lib/python3.7/site-packages/transformers/data/data_collator.py in (.0)
193 ) -> Dict[str, torch.Tensor]:
194 if isinstance(examples[0], (dict, BatchEncoding)):
–> 195 examples = [e[“input_ids”] for e in examples]
196 batch = self._tensorize_batch(examples)
197 if self.mlm:

KeyError: ‘input_ids’

The line

tokenizerBERT(unlabelled_dataset['paragraphs'], padding=True, truncation=True)

is not stored anywhere, so you’re passing to the Trainer a dataset that hasn’t been tokenized.

ok, thank you. so simply store as a variable such as below?

tokenizerBERT = tokenizerBERT(unlabelled_dataset['paragraphs'], padding=True, truncation=True)

the reason i didn’t store it was because when i did it i could no longer save the output w/ the save_pretrained method such as shown below.

tokenizerBERT.save_pretrained(‘tokenizers/pytorch/labelled/BERT/’)

any thoughts?

If you store your datasets in the tokenizer it won’t work either. You need to store it in the variable you will send to Trainer a your train_dataset. Also I don’t know what Dataset class you are using, but tokenizer probably can’t take it directly.

i’m definitely missing something…curious to hear your views on what that is.

based on the HF documentation i thought it would be possible to simply pass an in memory dataframe:

unlabelled_dataset = Dataset.from_pandas(unlabelled_corpus) #create Dataset object for tokenization

take the relevant paragraphs column from the dataframe that i want to finetune the pretrained model on. this column has a few sentences or a paragraph on each row of the dataframe. and then pass it to the pretrained tokenizer after intialization:

tokenizerBERT = BertTokenizerFast.from_pretrained(‘bert-base-uncased’) #BERT model tokenization & check
encodingBERT = tokenizerBERT(unlabelled_dataset[‘paragraphs’], padding=True, truncation=True)

initializing the data collator w/ reference to the pretrained tokenizer:

data_collator_BERT = DataCollatorForLanguageModeling(tokenizer=tokenizerBERT, mlm=True, mlm_probability=0.15)

and then initializing the model, training arguements and trainer before calling it would do the trick:

model_BERT = BertForMaskedLM.from_pretrained(‘bert-base-uncased’)

training_args_BERT = TrainingArguments(
output_dir=’./BERT’,
num_train_epochs=10,
evaluation_strategy=‘steps’,
warmup_steps=10000,
weight_decay=0.01,
per_device_train_batch_size=64,
)

trainer_BERT = Trainer(
model=model_BERT,
args=training_args_BERT,
data_collator=data_collator_BERT,
train_dataset=encodingBERT,
)

trainer_BERT.train()
trainer.save_model(’./models/royalBERT’)

the features of the loaded dataset.from_pandas are provided below. i only am looking to train the model on the paragraphs column hence why i pass that column in the called pretrained tokenizer.

{‘index’: Value(dtype=‘string’, id=None), ‘filename’: Value(dtype=‘string’, id=None), ‘page-id’: Value(dtype=‘string’, id=None), ‘paragraph-id’: Value(dtype=‘string’, id=None), ‘paragraphs’: Value(dtype=‘string’, id=None), ‘index_level_0’: Value(dtype=‘int64’, id=None)}

A KeyError means the key you gave pandas isn’t valid. Before doing anything with the data frame, use print(df.columns) to see column exist or not.

print(df.columns)

I was getting a similar kind of error in one of my codes. Turns out, that particular index was missing from my data frame as I had dropped the empty dataframe 2 rows. If this is the case, you can do df.reset_index(inplace=True) and the error should be resolved.