Inference API - Sub-words display for Token Classification

AkimfromParis · June 25, 2023, 6:15pm

Hello everyone,

I am following the great notebook about Token Classification on BERT made by Hugging Face @nielsr

It’s working perfectly fine. The WordPiece tokenizer seems good with sub-works ## and -100. Similarly, the local inference at the end is good between sentences and predictions with IOB labels.

My only change is in the variable names. : )

label2id = {k: v for v, k in enumerate(data.Tag.unique())}
id2label = {v: k for v, k in enumerate(data.Tag.unique())}
label2id

Unfortunately, when I upload the model on the Hub and I test via the Host inference API, the labels are making mistakes on the entity cut in subwords. It’s cut without merging the subwords…

to take the example of Niels:

[CLS] -100
za 3
##hee -100
##r -100
khan 8
was 0
mar 0

I got the following labels…

Za (Per) hee (Per) r (Per) Khan (Per) was mar

Do you have any idea? Thank you in advance guys!

Akim

Topic		Replies	Views
Text Classification tokenizer problems on inference Intermediate	4	2293	October 12, 2022
Converting Word-level labels to WordPiece-level for Token Classification Intermediate	9	4571	January 13, 2021
How to know if a subtoken is a word or part of a word? 🤗Tokenizers	10	6774	August 29, 2022
Convert tokens and token-labels to string 🤗Transformers	7	7570	March 12, 2022
Token Classification Label order Intermediate	0	566	November 11, 2022

Inference API - Sub-words display for Token Classification

Related topics