Hi everyone! I’m trying to fine-tune on a NER task the Musixmatch/umberto-commoncrawl-cased-v1
model, on the italian section of the wikiann dataset. The notebook I’m looking up to is this: notebooks/token_classification.ipynb at master · huggingface/notebooks · GitHub.
Dataset’s initial structure is:
DatasetDict({
validation: Dataset({
features: ['tokens', 'ner_tags', 'langs', 'spans'],
num_rows: 10000
})
test: Dataset({
features: ['tokens', 'ner_tags', 'langs', 'spans'],
num_rows: 10000
})
train: Dataset({
features: ['tokens', 'ner_tags', 'langs', 'spans'],
num_rows: 20000
})
})
It has no labels but the DataCollatorForTokenClassification
should help me out generating them.
from transformers import DataCollatorForTokenClassification
from datasets import load_metric
data_collator = DataCollatorForTokenClassification(tokenizer)
metric = load_metric("seqeval")
from transformers import AutoModel, TrainingArguments, Trainer
model = AutoModel.from_pretrained("Musixmatch/umberto-commoncrawl-cased-v1")
model_name = model_checkpoint.split("/")[-1]
training_args = TrainingArguments(
f"{model_name}-finetuned-{task}", # output directory
evaluation_strategy = "epoch",
num_train_epochs=3, # total number of training epochs
learning_rate=2e-5,
per_device_train_batch_size=batch_size, # batch size per device during training
per_device_eval_batch_size=batch_size, # batch size for evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
logging_steps=10,
)
trainer = Trainer(
model,
training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
The error it raises when I run trainer.train()
is:
KeyError Traceback (most recent call last)
<ipython-input-16-3435b262f1ae> in <module>()
----> 1 trainer.train()
3 frames
/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py in __getitem__(self, k)
2041 if isinstance(k, str):
2042 inner_dict = {k: v for (k, v) in self.items()}
-> 2043 return inner_dict[k]
2044 else:
2045 return self.to_tuple()[k]
KeyError: 'loss'
How can I fix it? What am I doig wrong? Thanks for the help!