KeyError: 'loss' even though my dataset has labels

Hi everyone! I’m trying to fine-tune on a NER task the Musixmatch/umberto-commoncrawl-cased-v1 model, on the italian section of the wikiann dataset. The notebook I’m looking up to is this: notebooks/token_classification.ipynb at master · huggingface/notebooks · GitHub.
Dataset’s initial structure is:

DatasetDict({
    validation: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 10000
    })
    test: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 10000
    })
    train: Dataset({
        features: ['tokens', 'ner_tags', 'langs', 'spans'],
        num_rows: 20000
    })
})

It has no labels but the DataCollatorForTokenClassification should help me out generating them.

from transformers import DataCollatorForTokenClassification
from datasets import load_metric

data_collator = DataCollatorForTokenClassification(tokenizer)
metric = load_metric("seqeval")
from transformers import AutoModel, TrainingArguments, Trainer

model = AutoModel.from_pretrained("Musixmatch/umberto-commoncrawl-cased-v1")
model_name = model_checkpoint.split("/")[-1]

training_args = TrainingArguments(

    f"{model_name}-finetuned-{task}",        # output directory

    evaluation_strategy = "epoch",

    num_train_epochs=3,                      # total number of training epochs

    learning_rate=2e-5,

    per_device_train_batch_size=batch_size,  # batch size per device during training

    per_device_eval_batch_size=batch_size,   # batch size for evaluation

    warmup_steps=500,                        # number of warmup steps for learning rate scheduler

    weight_decay=0.01,                       # strength of weight decay

    logging_dir='./logs',                    # directory for storing logs

    logging_steps=10,

)

trainer = Trainer(

    model,

    training_args,

    train_dataset=tokenized_dataset["train"],

    eval_dataset=tokenized_dataset["validation"],

    data_collator=data_collator,

    tokenizer=tokenizer,

    compute_metrics=compute_metrics

)

The error it raises when I run trainer.train() is:

KeyError                                  Traceback (most recent call last)
<ipython-input-16-3435b262f1ae> in <module>()
----> 1 trainer.train()

3 frames
/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py in __getitem__(self, k)
   2041         if isinstance(k, str):
   2042             inner_dict = {k: v for (k, v) in self.items()}
-> 2043             return inner_dict[k]
   2044         else:
   2045             return self.to_tuple()[k]

KeyError: 'loss'

How can I fix it? What am I doig wrong? Thanks for the help!

No, you need to preprocess your dataset to generate them. The data collator is only there to pad those labels as well as the inputs. Have a look at one of the token classification example script or example notebook

1 Like