Creating Trainer object is deleting my 'labels' feature

I have a dataset called tokenized_datasets:

>>> tokenized_datasets
Dataset({
    features: ['attention_mask', 'input_ids', 'labels', 'token_type_ids'],
    num_rows: 755988
})
>>> tokenized_datasets[0].keys()
dict_keys(['attention_mask', 'input_ids', 'labels', 'token_type_ids'])

But when I create a Trainer object, the labels key just disappears!

>>> training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total # of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
)
>>> trainer = Trainer(
       model=model,                         # the instantiated šŸ¤— Transformers model to be trained
       args=training_args,                  # training arguments, defined above
       train_dataset=tokenized_datasets         # training dataset
)
>>> tokenized_datasets[0].keys()                             # WHAT???
dict_keys(['attention_mask', 'input_ids', 'token_type_ids'])

This causes training to fail like so:

>>> trainer.train()
KeyError                                  Traceback (most recent call last)
<ipython-input-108-21d21c7948cc> in <module>()
----> 1 trainer.train()

[SNIPPED]    

/home/sbendl/.local/lib/python3.6/site-packages/transformers/file_utils.py in __getitem__(self, k)
   1414         if isinstance(k, str):
   1415             inner_dict = {k: v for (k, v) in self.items()}
-> 1416             return inner_dict[k]
   1417         else:
   1418             return self.to_tuple()[k]

KeyError: 'loss'

Iā€™m at a loss here, and quite frustrated, why on earth is this happening? It doesnā€™t when I follow the (very similar) code here. All the code snippets are sequentially run in my notebook, thereā€™s not ā€œhiddenā€ code. I have a dataset, I pass it to the trainer, and as a result my dataset is broken.

How did you create your model? If the key is dropped by the Trainer, it means the model signature does not accept it. You cna also deactivate that behavior by passing remove_unused_columns=False in your TraininingArguments.

Thank you for that tipā€¦ Seems crazy to me that that option is enabled by defaultā€¦ That was, of course, the reason that it was being deleted. As for why itā€™s not working, still trying to figure that out. My model is created like:

from transformers import BertLMHeadModel, Trainer, TrainingArguments, BertTokenizer
model = BertLMHeadModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', truncation=True, padding=True)

Later, my dataset is created like this:

dataset = Dataset.from_pandas(pd.DataFrame(model_inputs, columns=['text']))

def tokenize_function(examples):
    return tokenizer(examples["text"], padding='max_length', max_length=90, add_special_tokens=True)

def make_labels(examples):
    result = {k: v for k, v in examples.items()}
    result['labels'] = copy.copy(result['text'])
    return result

labeled_dataset = dataset.map(make_labels, batched=True)
tokenized_datasets = labeled_dataset.map(tokenize_function, batched=True, num_proc=64, remove_columns=["text"])

Nevermind the previous post. The problem was that, in my confusion about why the label was disappearing, I switched the make_labels and tokenize_function around, and because of that I was passing raw text into the model, instead of the tokenized labels :slight_smile:

1 Like