Improving performance results for BERT

I’m using the bert-base-german-cased model to perform token classification with custom NER labels on a dataset of German court documents. I have 11 labels in total (including the O label), which are however not tagged in BIO form. I’m letting the model train and evaluate on an NVidia GeForce GTX Titan X.

But despite the good ressources and the model, which was actually pretrained on German judicial documents, the results are rather lacking.

precision    recall  f1-score   support

                              Date       0.87      0.99      0.93       407
                   Schadensbetrag       0.77      0.58      0.66       112
                            Delikt       0.59      0.50      0.54        44
                    Gestaendnis_ja       0.60      0.71      0.65        21
                    Vorstrafe_nein       0.00      0.00      0.00         6
Strafe_Gesamtfreiheitsstrafe_Dauer       0.76      0.91      0.83        35
          Strafe_Gesamtsatz_Betrag       0.42      0.52      0.46        25
           Strafe_Gesamtsatz_Dauer       0.52      0.82      0.64        28
                 Strafe_Tatbestand       0.30      0.29      0.30       283

                        micro avg       0.65      0.68      0.66       961
                        macro avg       0.54      0.59      0.56       961
                     weighted avg       0.64      0.68      0.66       961


What could be some steps to improve these results?
Perhaps it’s the low data count for some of the labels, or that the labels often are not single tokens but text spans of multiple tokens?

I would be glad for every hint of some more experienced users. I can also share data or other files, if they are relevant.

This is my config file:

{
    "data_dir": "./Data",
    "labels": "./Data/labels.txt",
    "model_name_or_path": "bert-base-german-cased",
    "output_dir": "./Data/Models",
    "task_type": "NER",
    "max_seq_length": 180,
    "num_train_epochs": 6,
    "per_device_train_batch_size": 48,
    "seed": 7,
    "fp16": true,
    "do_train": true,
    "do_predict": true,
    "do_eval": true
}

Anyone who could help on this topic?

As you suggest I’d start with exploration of your dataset. See how many examples of each tag/token you have and if rebalancing improves your scores.