I have a dataset called tokenized_datasets
:
>>> tokenized_datasets
Dataset({
features: ['attention_mask', 'input_ids', 'labels', 'token_type_ids'],
num_rows: 755988
})
>>> tokenized_datasets[0].keys()
dict_keys(['attention_mask', 'input_ids', 'labels', 'token_type_ids'])
But when I create a Trainer object, the labels
key just disappears!
>>> training_args = TrainingArguments(
output_dir='./results', # output directory
num_train_epochs=3, # total # of training epochs
per_device_train_batch_size=16, # batch size per device during training
per_device_eval_batch_size=64, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
)
>>> trainer = Trainer(
model=model, # the instantiated š¤ Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=tokenized_datasets # training dataset
)
>>> tokenized_datasets[0].keys() # WHAT???
dict_keys(['attention_mask', 'input_ids', 'token_type_ids'])
This causes training to fail like so:
>>> trainer.train()
KeyError Traceback (most recent call last)
<ipython-input-108-21d21c7948cc> in <module>()
----> 1 trainer.train()
[SNIPPED]
/home/sbendl/.local/lib/python3.6/site-packages/transformers/file_utils.py in __getitem__(self, k)
1414 if isinstance(k, str):
1415 inner_dict = {k: v for (k, v) in self.items()}
-> 1416 return inner_dict[k]
1417 else:
1418 return self.to_tuple()[k]
KeyError: 'loss'
Iām at a loss here, and quite frustrated, why on earth is this happening? It doesnāt when I follow the (very similar) code here. All the code snippets are sequentially run in my notebook, thereās not āhiddenā code. I have a dataset, I pass it to the trainer, and as a result my dataset is broken.