Hi all,
I am building a BertForTokenClassification model but I am having trouble figuring out how to format my dataset. I have already labeled my dataset with span labeling. So for example:
sequence = “Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge.”
In my dataset, I would have this labeled by hand as:
[(Hugging, B-org), (Face, I-org), (Inc., L-org), (is, O), (a, O), (company, O), … (Bridge, L-org)]
However, when I pass this through my BertTokenizer, I get the following tokens:
[[CLS], Hu, ##gging, Face, Inc., ., is, a, company, …, Bridge, [SEP]]
My question is, how do I handle the Hu, ##gging <-> Hugging label mismatch issue? I have Hugging labeled as B-org, and if I zip these tokens with my labels my labels will be offset by one:
[(Hu, B-org), (##gging, I-org), (Face, L-org), (Inc., O), (is, O), (a, O), (company, O), … (Bridge, OUT_OF_LABELS)]
Has anybody been able to handle this problem before?