Bug with tokernizer's offset mapping for NER problems?

I’m working on NER and am following the tutorial from Token Classification with W-NUT Emerging Entities. I’m relying on the code in that tutorial to identify which tokens are valid and which tokens have been added by the Tokenizer, such as subword tokens and special tokens like [CLS].

The tutorial says the following:

Now we arrive at a common obstacle with using pre-trained models for token-level classification: many of the tokens in the W-NUT corpus are not in DistilBert’s vocabulary. Bert and many models like it use a method called WordPiece Tokenization, meaning that single words are split into multiple tokens such that each token is likely to be in the vocabulary.

Let’s write a function to do this. This is where we will use the offset_mapping from the tokenizer as mentioned above. For each sub-token returned by the tokenizer, the offset mapping gives us a tuple indicating the sub-token’s start position and end position relative to the original token it was split from. That means that if the first position in the tuple is anything other than 0, we will set its corresponding label to -100. While we’re at it, we can also set labels to -100 if the second position of the offset mapping is 0, since this means it must be a special token like [PAD] or [CLS].

I get different results for the offset mapping from the tokenizer depending on when whether the input text is a complete sentence or a list of tokens.


batch_sentences = ['The quick brown fox jumped over the lazy dog.',
                   'That dog is really lazy.']

encoded_dict = tokenizer(text=batch_sentences,
                                add_special_tokens=True,
                                max_length=64,
                                padding=True,
                                truncation=True,
                                return_token_type_ids=True,
                                return_attention_mask=True,
                                return_offsets_mapping=True,
                                return_tensors='pt'
                                )

print(encoded_dict.offset_mapping)

That prints:

tensor([[[ 0,  0],
         [ 0,  3],
         [ 4,  9],
         [10, 15],
         [16, 19],
         [20, 26],
         [27, 31],
         [32, 35],
         [36, 40],
         [41, 44],
         [44, 45],
         [ 0,  0]],

        [[ 0,  0],
         [ 0,  4],
         [ 5,  8],
         [ 9, 11],
         [12, 18],
         [19, 23],
         [23, 24],
         [ 0,  0],
         [ 0,  0],
         [ 0,  0],
         [ 0,  0],
         [ 0,  0]]])

On the other hand, if the sentences are already split, I get different results:

batch_sentences = [['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.'],
                   ['That', 'dog', 'is', 'really', 'lazy.']]

encoded_dict = tokenizer(text=batch_sentences,
                            is_split_into_words=True, # <--- different
                            add_special_tokens=True,
                            max_length=64,
                            padding=True,
                            truncation=True,
                            return_token_type_ids=True,
                            return_attention_mask=True,
                            return_offsets_mapping=True,
                            return_tensors='pt'
                            )

print(encoded_dict.offset_mapping)

That prints:

tensor([[[0, 0],
         [0, 3],
         [0, 5],
         [0, 5],
         [0, 3],
         [0, 6],
         [0, 4],
         [0, 3],
         [0, 4],
         [0, 3],
         [3, 4],
         [0, 0]],

        [[0, 0],
         [0, 4],
         [0, 3],
         [0, 2],
         [0, 6],
         [0, 4],
         [4, 5],
         [0, 0],
         [0, 0],
         [0, 0],
         [0, 0],
         [0, 0]]])

Here’s a Colab notebook with a full working example.

If this is a bug, I’ll open a ticket in Github.

I’m unsure what you think the bug is: the offset_mappings are maps from tokens to the original texts. If you provide the original texts in different formats, you are going to get different results. Each time you come back to 0 in the second results corresponds to one of your new words, and you get (0, 0) for special tokens, which is what the tutorial you mention detects.

For non-split texts, you get the spans in the original text (though I’m not sure how you get your labels in that case?)

Note that if you only want to detect the special tokens, you can use the special_tokens_mask the tokenizer can return if you add the flag return_special_tokens_mask=True. Also, for another approach using the word_ids method the fast tokenizer provide, you should check out the token classification example script.

Thank you for the explanation.

I see the source of my misunderstanding. As I mentioned, the tutorial has this passage:

This is where we will use the offset_mapping from the tokenizer as mentioned above. For each sub-token returned by the tokenizer, the offset mapping gives us a tuple indicating the sub-token’s start position and end position relative to the original token it was split from.

What I didn’t fully understand is that the sentences in that NER example were already pre-split into tokens. I thought that if you pass in non-split sentences into the tokenizer, then it would return offset_mapping with the same values as the pre-split sentence since the tokenizer would still do tokenization.

Note that if you only want to detect the special tokens, you can use the special_tokens_mask the tokenizer can return if you add the flag return_special_tokens_mask=True .

I also want to mask out the sub-token pieces that were split off of tokens. I tried out the special_tokens_mask, and it only marks the added tokens for things like [CLS] but not sub-token pieces.

If you want to maks subtokens and special tokens, look at the script I mentioned in my earlier since it does just that with the word_ids method.