I am about to fine-tune a BERT model on the NER task using a legal dataset with custom entities, and would like to know how the fine tuning on the ConLL 2003 dataset was handled at the time in order to create a pertained BertForTokenClassification model, because I’m facing similar issues. The NER dataset here contains one token (or rather word) per line. However, the HuggingFace BERT tokenizer (e.g. “bert-base-cased” or any other) will not produce a one-to-one match with this dataset. Just to give an example, the word “precautionary” (which on the conll 2003 dataset would appear in one line) is split by the HuggingFace tokenizer into
['pre', '##ca', '##ution', '##ary'], and I assume the opposite might be true as well, although perhaps much rarer (i.e. that tokens which were split into two lines in the conll 2003 dataset would be tokenized by HuggingFace as a single token).
Therefore, I was wondering what transformation was done to convert the CoNLL 2003 dataset (in the format I linked above) to a set of token-level labels corresponding to the BERT tokenizer suitable for creating a pytorch’s DataLoader.