Hello everyone,
I am following the great notebook about Token Classification on BERT made by Hugging Face @nielsr
It’s working perfectly fine. The WordPiece tokenizer seems good with sub-works ## and -100. Similarly, the local inference at the end is good between sentences and predictions with IOB labels.
My only change is in the variable names. : )
label2id = {k: v for v, k in enumerate(data.Tag.unique())}
id2label = {v: k for v, k in enumerate(data.Tag.unique())}
label2id
Unfortunately, when I upload the model on the Hub and I test via the Host inference API, the labels are making mistakes on the entity cut in subwords. It’s cut without merging the subwords…
to take the example of Niels:
[CLS] -100
za 3
##hee -100
##r -100
khan 8
was 0
mar 0
I got the following labels…
Za (Per) hee (Per) r (Per) Khan (Per) was mar
Do you have any idea? Thank you in advance guys!
Akim