I’m using the bert-base-german-cased model to perform token classification with custom NER labels on a dataset of German court documents. I have 11 labels in total (including the O label), which are however not tagged in BIO form. I’m letting the model train and evaluate on an NVidia GeForce GTX Titan X.
But despite the good ressources and the model, which was actually pretrained on German judicial documents, the results are rather lacking.
precision recall f1-score support
Date 0.87 0.99 0.93 407
Schadensbetrag 0.77 0.58 0.66 112
Delikt 0.59 0.50 0.54 44
Gestaendnis_ja 0.60 0.71 0.65 21
Vorstrafe_nein 0.00 0.00 0.00 6
Strafe_Gesamtfreiheitsstrafe_Dauer 0.76 0.91 0.83 35
Strafe_Gesamtsatz_Betrag 0.42 0.52 0.46 25
Strafe_Gesamtsatz_Dauer 0.52 0.82 0.64 28
Strafe_Tatbestand 0.30 0.29 0.30 283
micro avg 0.65 0.68 0.66 961
macro avg 0.54 0.59 0.56 961
weighted avg 0.64 0.68 0.66 961
What could be some steps to improve these results?
Perhaps it’s the low data count for some of the labels, or that the labels often are not single tokens but text spans of multiple tokens?
I would be glad for every hint of some more experienced users. I can also share data or other files, if they are relevant.
This is my config file:
{
"data_dir": "./Data",
"labels": "./Data/labels.txt",
"model_name_or_path": "bert-base-german-cased",
"output_dir": "./Data/Models",
"task_type": "NER",
"max_seq_length": 180,
"num_train_epochs": 6,
"per_device_train_batch_size": 48,
"seed": 7,
"fp16": true,
"do_train": true,
"do_predict": true,
"do_eval": true
}