Inference API - Sub-words display for Token Classification

Hello everyone,

I am following the great notebook about Token Classification on BERT made by Hugging Face @nielsr

It’s working perfectly fine. The WordPiece tokenizer seems good with sub-works ## and -100. Similarly, the local inference at the end is good between sentences and predictions with IOB labels.

My only change is in the variable names. : )

label2id = {k: v for v, k in enumerate(data.Tag.unique())}
id2label = {v: k for v, k in enumerate(data.Tag.unique())}
label2id

Unfortunately, when I upload the model on the Hub and I test via the Host inference API, the labels are making mistakes on the entity cut in subwords. It’s cut without merging the subwords…

to take the example of Niels:

[CLS] -100
za 3
##hee -100
##r -100
khan 8
was 0
mar 0

I got the following labels…

Za (Per) hee (Per) r (Per) Khan (Per) was mar

Do you have any idea? Thank you in advance guys!

Akim