Converting Word-level labels to WordPiece-level for Token Classification

Hi all,

I am building a BertForTokenClassification model but I am having trouble figuring out how to format my dataset. I have already labeled my dataset with span labeling. So for example:

sequence = “Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very close to the Manhattan Bridge.”

In my dataset, I would have this labeled by hand as:
[(Hugging, B-org), (Face, I-org), (Inc., L-org), (is, O), (a, O), (company, O), … (Bridge, L-org)]

However, when I pass this through my BertTokenizer, I get the following tokens:
[[CLS], Hu, ##gging, Face, Inc., ., is, a, company, …, Bridge, [SEP]]

My question is, how do I handle the Hu, ##gging <-> Hugging label mismatch issue? I have Hugging labeled as B-org, and if I zip these tokens with my labels my labels will be offset by one:
[(Hu, B-org), (##gging, I-org), (Face, L-org), (Inc., O), (is, O), (a, O), (company, O), … (Bridge, OUT_OF_LABELS)]

Has anybody been able to handle this problem before?

1 Like

Hi @altozachmo,
you can “extend” the labels list, adding as many labels as the number of token splits.
So, for example:

sequence = “Hugging Face"
labels = [(Hugging, B-org),(Face, I-org)]
tokenized_sentence = [[CLS], Hu, ##gging, Face, [SEP]]
tokenized_labels = [(Hu, B-org),(##gging, B-org),(Face, I-org)]

(the tokenized_labels should include also the labels for the CLS and SEP tokens but I omitted them)

To do this, you can check this tutorial and look for the “tokenize_and_preserve_labels” function

1 Like

Thank you! That tutorial is very helpful!

Hi all,
How can I do the opposite, starting from WordPiece-level to Word-level !

I think you could implement a Python script which merges all the adjacent token pieces in which there is a “##” char between them.
So for example (pseudocode):

tokenized_sentence = [[CLS], Hu, ##gging, Face, [SEP]]
for token in tokenized_sentence:
    if next_token in the sentence starts with "##":
        merge token and next token in one token

Unfortunately, I do not have a tutorial about this, but the solution could be implemented starting from this pseudocode.

@Sergio Thank you for your reply,
It’s so clear that the solution could be implemented if you already know how the tokenizer work, for me, I am using CamemBERT model where I couldn’t find anything to help to understand how its tokenization works, and I tried to just watch how it deals with some sentences, then I have implemented the solution but I’m not sure if the implementation will not have bugs.

I’ve been working on NER with BERT and have also encountered the problem of aligning NER tags with sub-word pieces. I’ve found two approaches:

  1. HuggingFace provides a sample implementation (huggingface.co) where if a token is broken into sub-word pieces, then the NER tag is associated with only the first sub-word piece, and remaining sub-word pieces that were broken off are ignored. For example, if Washington is a LOCATION, then a potential sub-word tokenization and assignment of labels would be:
Before tokenization:
  Washington      LOCATION

After tokenization:
  Wash    LOCATION
  ##ing   ignore
  ##ton   ignore
  1. The approach from the previously-mentioned tutorial (depends-on-the-definition.com) instead places the same NER label on all sub-pieces. Here is how the above example would be treated:
Before tokenization:
  Washington      LOCATION

After tokenization:
  Wash    LOCATION
  ##ing   LOCATION
  ##ton   LOCATION

The approach in 1 is more efficient because the encoding for your text is done one sentence at a time. See the magic in the function encode_tags() of that tutorial.

The approach in 2 requires that words are tokenized and the labels are extended for every token in your text, as shown here:

    for word, label in zip(sentence, text_labels):

        # Tokenize the word and count # of subwords the word is broken into
        tokenized_word = tokenizer.tokenize(word)
        n_subwords = len(tokenized_word)

        # Add the tokenized word to the final tokenized word list
        tokenized_sentence.extend(tokenized_word)

        # Add the same label to the new list of labels `n_subwords` times
        labels.extend([label] * n_subwords)
2 Likes

It’s so clear that the solution could be implemented if you already know how the tokenizer work, for me, I am using CamemBERT model where I couldn’t find anything to help to understand how its tokenization works, and I tried to just watch how it deals with some sentences, then I have implemented the solution but I’m not sure if the implementation will not have bugs.

The PreTrainedTokenizer class has a method called decode() that converts token_ids (potentially corresponding to subword pieces) to complete tokens again. There’s a tight loop iterating over each token. I can’t make out the logic myself.

1 Like

Hi,
We put out to posts about aligning raw spans to tokens.
This one has the implementation details with considerations for padding / batching and generally dealing with longer texts.

Also, there’s a PR here that provides a visualization utility so that you can visually check your alignments in a notebook.

Hope that helps

Hello facehugger2020,

Thanks so much for your thorough explanation. But I’d like to ask one follow up question.

As another user posted on AllenNLP github issues, saying that huggingface transformer uses pad_token_label_id to solve problem of mis-matched subtokens, in which the first subtoken will be labeled to the original true label, then the following subtokens are labeled as pad_token_label_id. The model will ignore all these padding labels in training.

I am not sure if the features is on by default, if so, does that mean we don’t need to apply encode_tags and follow-up re-labeling process anymore?

Please find the referred issue here.

Thank again for any clarification.