Hello, HuggingFace community. I’ve got unstructured lab reports which contains the values of each test result. For example, this is a report containing the test results for magnesium (MAGNESIO
, 2,0
), potassium (POTASSIO
, 4,9
) and sodium (SODIO
, 137
).
MAGNESIO\nMaterial de Coleta: SORO\nMétodo: Clorofosfonazo IlI\nReferência\nResultado:\n2,0\nmg/dL\n1 1,7 a 2,5 mg/dL\nPOTASSIO\nMaterial de Coleta: SORO\nMétodo: Eletrodo Seletivo\nReferência\nResultado\n4,9\nmEq/L\n1 3,5 a 5,1 mEq/L\nSODIO\nMaterial de Coleta: SORO\nMétodo: Eletrodo Seletivo\nReferência\nResultado\n137\nmEq/L\n/ 135 a 145 mEq/L\n
(Test name and result annotated for ease of reading)
I would like to use a BERT-like model to extract this information in a structure similar as:
{
"magnesium": "2,0",
"potassium": "4,9",
"sodium": "137"
}
Since my inputs are in the Portuguese language, I figured BERTimbau would be a good foundational model. Is using BERT the appropriate way to solve my problem? How would I go about annotating my training data and setting up my model for training?