Converting Word-level labels to WordPiece-level for Token Classification

facehugger2020 · November 24, 2020, 11:59pm

I’ve been working on NER with BERT and have also encountered the problem of aligning NER tags with sub-word pieces. I’ve found two approaches:

HuggingFace provides a sample implementation (huggingface.co) where if a token is broken into sub-word pieces, then the NER tag is associated with only the first sub-word piece, and remaining sub-word pieces that were broken off are ignored. For example, if Washington is a LOCATION, then a potential sub-word tokenization and assignment of labels would be:

Before tokenization:
  Washington      LOCATION

After tokenization:
  Wash    LOCATION
  ##ing   ignore
  ##ton   ignore

The approach from the previously-mentioned tutorial (depends-on-the-definition.com) instead places the same NER label on all sub-pieces. Here is how the above example would be treated:

Before tokenization:
  Washington      LOCATION

After tokenization:
  Wash    LOCATION
  ##ing   LOCATION
  ##ton   LOCATION

The approach in 1 is more efficient because the encoding for your text is done one sentence at a time. See the magic in the function encode_tags() of that tutorial.

The approach in 2 requires that words are tokenized and the labels are extended for every token in your text, as shown here:

    for word, label in zip(sentence, text_labels):

        # Tokenize the word and count # of subwords the word is broken into
        tokenized_word = tokenizer.tokenize(word)
        n_subwords = len(tokenized_word)

        # Add the tokenized word to the final tokenized word list
        tokenized_sentence.extend(tokenized_word)

        # Add the same label to the new list of labels `n_subwords` times
        labels.extend([label] * n_subwords)

Topic		Replies	Views
Multi-input tag and ,multi-label output for token classification using Bert pretrained model 🤗Transformers	1	86	January 9, 2025
Inference API - Sub-words display for Token Classification 🤗Hub	0	374	June 25, 2023
How to structure labels for token classification? 🤗Transformers	5	3274	August 29, 2021
Token Classification Label order Intermediate	0	566	November 11, 2022
Word Specific Classification (custom token classification) Research	0	152	May 30, 2024

Converting Word-level labels to WordPiece-level for Token Classification

Related topics