I’m working on NER and am following the tutorial from Token Classification with W-NUT Emerging Entities. I’m relying on the code in that tutorial to identify which tokens are valid and which tokens have been added by the Tokenizer, such as subword tokens and special tokens like [CLS]
.
The tutorial says the following:
Now we arrive at a common obstacle with using pre-trained models for token-level classification: many of the tokens in the W-NUT corpus are not in DistilBert’s vocabulary. Bert and many models like it use a method called WordPiece Tokenization, meaning that single words are split into multiple tokens such that each token is likely to be in the vocabulary.
Let’s write a function to do this. This is where we will use the
offset_mapping
from the tokenizer as mentioned above. For each sub-token returned by the tokenizer, the offset mapping gives us a tuple indicating the sub-token’s start position and end position relative to the original token it was split from. That means that if the first position in the tuple is anything other than 0, we will set its corresponding label to-100
. While we’re at it, we can also set labels to-100
if the second position of the offset mapping is 0, since this means it must be a special token like[PAD]
or[CLS]
.
I get different results for the offset mapping from the tokenizer depending on when whether the input text is a complete sentence or a list of tokens.
batch_sentences = ['The quick brown fox jumped over the lazy dog.',
'That dog is really lazy.']
encoded_dict = tokenizer(text=batch_sentences,
add_special_tokens=True,
max_length=64,
padding=True,
truncation=True,
return_token_type_ids=True,
return_attention_mask=True,
return_offsets_mapping=True,
return_tensors='pt'
)
print(encoded_dict.offset_mapping)
That prints:
tensor([[[ 0, 0],
[ 0, 3],
[ 4, 9],
[10, 15],
[16, 19],
[20, 26],
[27, 31],
[32, 35],
[36, 40],
[41, 44],
[44, 45],
[ 0, 0]],
[[ 0, 0],
[ 0, 4],
[ 5, 8],
[ 9, 11],
[12, 18],
[19, 23],
[23, 24],
[ 0, 0],
[ 0, 0],
[ 0, 0],
[ 0, 0],
[ 0, 0]]])
On the other hand, if the sentences are already split, I get different results:
batch_sentences = [['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog.'],
['That', 'dog', 'is', 'really', 'lazy.']]
encoded_dict = tokenizer(text=batch_sentences,
is_split_into_words=True, # <--- different
add_special_tokens=True,
max_length=64,
padding=True,
truncation=True,
return_token_type_ids=True,
return_attention_mask=True,
return_offsets_mapping=True,
return_tensors='pt'
)
print(encoded_dict.offset_mapping)
That prints:
tensor([[[0, 0],
[0, 3],
[0, 5],
[0, 5],
[0, 3],
[0, 6],
[0, 4],
[0, 3],
[0, 4],
[0, 3],
[3, 4],
[0, 0]],
[[0, 0],
[0, 4],
[0, 3],
[0, 2],
[0, 6],
[0, 4],
[4, 5],
[0, 0],
[0, 0],
[0, 0],
[0, 0],
[0, 0]]])
Here’s a Colab notebook with a full working example.
If this is a bug, I’ll open a ticket in Github.