How to compute metrics when ground truth and predictions have different data?

I am evaluating different layout models (like LayoutLM) to perform entity extraction from a dataset of scanned forms. Phrases detected in these forms are classified as keys and values.

I have annotated a dataset by marking the relevant phrases, and their labels (key or value).

The candidate models use OCR to extract text, and the text extracted is then labeled by the model (using BERT style BIO tags to identify phrases). So, each model may come up with different number of phrases, and different words in the phrases than the annotated ground truth model.

How do I compute metrics such as precision, recall and F1 score to compare these models? Is there a standardized approach for doing this?

As a first approximation, I am considering doing the following:

  • Split phrases into words, remove punctuations, lowercase all words
  • Apply labels to each word - so the ground truth and predictions are each (word, label) pairs - Note that there can be duplicate words in the same form
  • Add entries for missing words from ground truth in predictions and vice versa to align the two. Label all such added entries as a special other class: ‘ZZZ’
  • Sort the ground truth and predictions alphabetically
  • Use the labels from ground truth and predictions to compute metrics as usual

Is this a reasonable approach? Is there anything better possible?

Some other things we could try (not sure if they are necessarily better):

  • Ideally, we also use the layout information - to ensure that words/phrases with a given label are in the same region. But aligning bounding boxes exactly in annotation and inference can get tricky.
  • We could also label the phrases directly instead of splitting into words
  • We could use the ocr directly without correcting its text - but each model uses a different ocr engine, and we would need to use these - in that case, we would use the bounding box as the feature, and would need to align it properly.