Converting Word-level labels to WordPiece-level for Token Classification

I’ve been working on NER with BERT and have also encountered the problem of aligning NER tags with sub-word pieces. I’ve found two approaches:

  1. HuggingFace provides a sample implementation (huggingface.co) where if a token is broken into sub-word pieces, then the NER tag is associated with only the first sub-word piece, and remaining sub-word pieces that were broken off are ignored. For example, if Washington is a LOCATION, then a potential sub-word tokenization and assignment of labels would be:
Before tokenization:
  Washington      LOCATION

After tokenization:
  Wash    LOCATION
  ##ing   ignore
  ##ton   ignore
  1. The approach from the previously-mentioned tutorial (depends-on-the-definition.com) instead places the same NER label on all sub-pieces. Here is how the above example would be treated:
Before tokenization:
  Washington      LOCATION

After tokenization:
  Wash    LOCATION
  ##ing   LOCATION
  ##ton   LOCATION

The approach in 1 is more efficient because the encoding for your text is done one sentence at a time. See the magic in the function encode_tags() of that tutorial.

The approach in 2 requires that words are tokenized and the labels are extended for every token in your text, as shown here:

    for word, label in zip(sentence, text_labels):

        # Tokenize the word and count # of subwords the word is broken into
        tokenized_word = tokenizer.tokenize(word)
        n_subwords = len(tokenized_word)

        # Add the tokenized word to the final tokenized word list
        tokenized_sentence.extend(tokenized_word)

        # Add the same label to the new list of labels `n_subwords` times
        labels.extend([label] * n_subwords)
2 Likes