Hello,
I’m fine-tuning RoBERTa for Token Classification task,
here you can find an example of this:
When we use any text,
for instance,
text = “Department of Cardiology, University Hospital of Nice, Nice, France.”
Entities are identified well in the json_output, but there appear an extra space in front of all the entities
[
{
“entity_group”: “SUB”,
“score”: 0.9959414601325989,
“word”: " Department of Cardiology",
“start”: 0,
“end”: 24
},
{
“entity_group”: “ORG”,
“score”: 0.9965003728866577,
“word”: " University Hospital of Nice",
“start”: 26,
“end”: 53
},
{
“entity_group”: “CITY”,
“score”: 0.9671096801757812,
“word”: " Nice",
“start”: 55,
“end”: 59
},
{
“entity_group”: “COUNTRY”,
“score”: 0.9924795627593994,
“word”: " France",
“start”: 61,
“end”: 67
}
]
Furthermore, in the visual highlight in the inference API when text starts with an entity this is not painted in inference API, however this appears in the json_output.
Someone know where the error comes from? We assume it’s not painted because identifies the entity with an extra space before that doesn’t occur in the text. Or could this thing come from RoBERTa tokenizer? Because we wanted to use RoBERTa model, but when we used BERT this didn’t happen. How can we avoid this extra space before the recognised entity?
Many thanks!