Yes, you’re right that the problem is happening on the trainer.evaluate()
step. It might be coming from the label_names
argument in your TrainingArguments
. From the docs we have:
The list of keys in your dictionary of inputs that correspond to the labels.
Will eventually default to
["labels"]
except if the model used is one of theXxxForQuestionAnswering
in which case it will default to["start_positions", "end_positions"]
.
So it seems you need to provide a list like ['label']
instead of the string. If that doesn’t work, you could try renaming the “label” column in your CSV files to “labels” and then dropping the label_names
argument from TrainingArguments
.
You can then check if it works by just running
trainer.evaluate()
which is faster than waiting for one epoch of training
As a tip, I would also specify all the implicit arguments of your TrainingArguments
and Trainer
explicitly, e.g. use ouput_dir="test_20210201_1200"
in TrainingArguments
and similarly for model
and args
in Trainer
.
PS. one thing that looks a bit odd is the way you load the metric:
metric = load_metric('f1', 'accuracy')
I don’t think you can load multiple metrics this way since the second argument refers to the “configuration” of the metric (e.g. GLUE has a config for each task). Nevertheless, this is probably not the source of the problem.