KeyError: 'loss' when fine-tuning a Transformer model

I am trying to fine tune a transformer model on my own unlabeled corpus of text. My code for doing this is:

from datasets import load_dataset
from transformers import BertTokenizerFast
from transformers import AutoModel
from transformers import TrainingArguments
from transformers import Trainer
import glob
import os


base_path = '../data/'
model_name = 'bert-base-uncased'
max_length = 512
checkpoints_dir = 'checkpoints'

if not os.path.exists(checkpoints_dir):
    os.mkdir(checkpoints_dir)

tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)


def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=max_length)


dataset = load_dataset('text',
        data_files={
            'train': f'{base_path}train.txt',
            'test': f'{base_path}test.txt',
            'validation': f'{base_path}valid.txt'
        }
)

print('Tokenizing data. This may take a while...')
tokenized_dataset = dataset.map(tokenize_function, batched=True)
train_dataset = tokenized_dataset['train']
eval_dataset = tokenized_dataset['test']

model = AutoModel.from_pretrained(model_name)

training_args = TrainingArguments(checkpoints_dir)

trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset)
trainer.train()

However, I get KeyError: 'loss' when running the code at trainer.train(). How do I fix this?

Hello! :wave:

Make sure your dataset has a "labels" column, otherwise the Trainer won’t recognize the labels and won’t calculate a loss.

Thanks for the swift reply, @beneyal! I don’t have labels in my data, since my end goal is to simply fine-tune the weights and use it for multiple tasks (but primarily for generating embeddings). How can I achieve that?

No problem, you’re very welcome! :slight_smile:

Even if you just want to fine-tune the embedding weights, you need some kind of labelled dataset so that a loss will be calculated and the errors propagated. If you don’t have a labelled dataset for some specific downstream task, you can always go for Masked Language Modeling or Causal Language Modeling pre-training routines.

Thanks a lot! Using the MLM tutorials helped me fix this!

1 Like