How to deal with differences between CoNLL 2003 dataset tokenisation and BER tokeniser when fine tuning NER model?

Hello,

I am about to fine-tune a BERT model on the NER task using a legal dataset with custom entities, and would like to know how the fine tuning on the ConLL 2003 dataset was handled at the time in order to create a pertained BertForTokenClassification model, because I’m facing similar issues. The NER dataset here contains one token (or rather word) per line. However, the HuggingFace BERT tokenizer (e.g. “bert-base-cased” or any other) will not produce a one-to-one match with this dataset. Just to give an example, the word “precautionary” (which on the conll 2003 dataset would appear in one line) is split by the HuggingFace tokenizer into ['pre', '##ca', '##ution', '##ary'], and I assume the opposite might be true as well, although perhaps much rarer (i.e. that tokens which were split into two lines in the conll 2003 dataset would be tokenized by HuggingFace as a single token).

Therefore, I was wondering what transformation was done to convert the CoNLL 2003 dataset (in the format I linked above) to a set of token-level labels corresponding to the BERT tokenizer suitable for creating a pytorch’s DataLoader.

1 Like

What is typically done is, you tokenize each word of the annotated dataset you have, check how many tokens it has been tokenized into, and then either only label the first wordpiece token, or label them all.

Small example:

Suppose you have the sentence “hello my name is Niels”, and the CoNLL dataset has this labeled as:

hello O
my O
name O
is O
niels B-PER

Then what we do is the following:

  • option 1, label all tokens of a word (i.e. propagate the label on all tokens)
from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

words = ["hello", "my", "name", "is", "niels"]
word_labels = ["O", "O", "O", "O", "B-PER"]
# convert word labels to integer labels
label2id  = {"O": 0, "B-PER":1}
word_labels = [label2id[label] for label in word_labels]

tokens = []
labels = []
for word, label in zip(words, word_labels):
     # tokenize the word
     word_tokens = tokenizer.tokenize(word)
     # propagate label to all tokens of the given word
     tokens.extend(word_tokens)
     labels.extend([label] * len(word_tokens))
  • option 2, only label the first token of a word, set labels of all remaining tokens to -100
words = ["hello", "my", "name", "is", "niels"]
word_labels = ["O", "O", "O", "O", "B-PER"]
# convert word labels to integer labels
label2id  = {"O": 0, "B-PER":1}
word_labels = [label2id[label] for label in word_labels]

tokens = []
labels = []
for word, label in zip(words, word_labels):
     # tokenize the word
     word_tokens = tokenizer.tokenize(word)
     tokens.extend(word_tokens)
     # only label the first wordpiece
     labels.extend([label] + [-100] * (len(word_tokens) - 1))

The reason we set the remaining tokens to -100 is because this is the ignore_index of PyTorch’ CrossEntropyLoss. This means that those labels will not be taken into account by the loss function, hence no gradients will be computed for those.

Which of the 2 options you choose is a design choice, mainly. In practice, both perform well. I’ve made 2 notebooks (one for each of both options), in my repo here.

1 Like

Great, thank you very much for the detailed explanation!

Hi @nielsr
I have now gone through your notebooks and have a further (slightly different) question on your choice of evaluation strategy. I thought I’d post it here to continue the conversation, even though it’s not exactly related to the title topic.
I have seen that you calculated the performance using the accuracy from sklearn.metrics.accuracy_score
However, the way in which you apply it in your code, i.e. accuracy_score(labels.cpu().numpy(), predictions.cpu().numpy()) looks like it’s doing mostly a token-by-token strict comparison, so if there are say 13 different labels, and the model get “B-ORG” instead of “I-ORG”, or “O” instead of “I-ORG” they would both be considered equally “wrong”.
I am reading here that there are many other ways to evaluate performance on NER datasets, which on the surface at least seem quite sensible, and look at the entities rather than at each individual token. Out of curiosity, is there a specific reason why you chosen token-level accuracy as opposed to, for example, entity-level F1 scores or other metrics?
Thanks a lot!

Hi,

I just checked my notebooks, I actually do evaluate at the named entity level, rather than at the token level. I do this using the seqeval library.

It’s actually the classification_report that shows the performance for the different NER categories.

Sorry I missed that detail. Many thanks for pointing it out!

Hi @nielsr Thank you so much for the effort on the notebooks! I was following your tutorials. Just a follow-up question. In my customized dataset I also got many broken subwords as @AndreaSottana mentioned. I was just wondering, what if I design a new tokenizer that can handle my dataset to avoid the subwords problem caused by WordPiece, then retrain BERT model on my customized tokenizer and dataset. Will this kinda solve the problem for NER labeling? The intuition is by doing this I can reduce the number of subwords to make BERT learn my domain context better. I am guessing there might be some issues with the WordPiece tokenizer (where the BERT were original trained) and my customized tokenizer working on my customized datasets. Any ideas or comments?