How to compute metrics when ground truth and predictions have different data?

rcanand · January 4, 2023, 4:01am

I am evaluating different layout models (like LayoutLM) to perform entity extraction from a dataset of scanned forms. Phrases detected in these forms are classified as keys and values.

I have annotated a dataset by marking the relevant phrases, and their labels (key or value).

The candidate models use OCR to extract text, and the text extracted is then labeled by the model (using BERT style BIO tags to identify phrases). So, each model may come up with different number of phrases, and different words in the phrases than the annotated ground truth model.

How do I compute metrics such as precision, recall and F1 score to compare these models? Is there a standardized approach for doing this?

As a first approximation, I am considering doing the following:

Split phrases into words, remove punctuations, lowercase all words
Apply labels to each word - so the ground truth and predictions are each (word, label) pairs - Note that there can be duplicate words in the same form
Add entries for missing words from ground truth in predictions and vice versa to align the two. Label all such added entries as a special other class: ‘ZZZ’
Sort the ground truth and predictions alphabetically
Use the labels from ground truth and predictions to compute metrics as usual

Is this a reasonable approach? Is there anything better possible?

Some other things we could try (not sure if they are necessarily better):

Ideally, we also use the layout information - to ensure that words/phrases with a given label are in the same region. But aligning bounding boxes exactly in annotation and inference can get tricky.
We could also label the phrases directly instead of splitting into words
We could use the ocr directly without correcting its text - but each model uses a different ocr engine, and we would need to use these - in that case, we would use the bounding box as the feature, and would need to align it properly.

Topic		Replies	Views
Adding accuracy, precision, recall and f1 score metrics during training Beginners	1	5215	March 9, 2023
Improving Key-Value Pair Extraction with LayoutLM and LiLT on Custom OCR Dataset Research	2	261	February 21, 2025
Getting the same value for all evaluation metrics Models	1	106	July 21, 2024
Dataset preparation for LayoutLM and LiLT Research	1	60	April 27, 2025
How to compute accuracy and precision for each class in text classification task? Beginners	2	865	October 30, 2023

How to compute metrics when ground truth and predictions have different data?

Related topics