Issues with offset_mapping values

Hi guys, I am trying to work with a FairSeq model converted to :hugs: but I have some issues with tokenizer. I am trying to fine-tune it for POS tagging so I have the text split to words already and I want to use the offset_mapping to detect first token for each word. I do it like this:

tokenizer = RobertaTokenizerFast.from_pretrained('path', add_prefix_space=True)
ids = tokenizer([['drieme', 'drieme'], ['drieme']],
    is_split_into_words=True,
    padding=True,
    return_offsets_mapping=True)

The tokenization looks like this:

['<s>', 'Ġd', 'rieme', 'Ġd', 'rieme', '</s>']

But the output from the command looks like this:

{
  'input_ids': [[0, 543, 24209, 543, 24209, 2], [0, 543, 24209, 2, 1, 1]],
  'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 0, 0]],
  'offset_mapping': [
    [(0, 0), (0, 1), (1, 6), (1, 1), (1, 6), (0, 0)],
    [(0, 0), (0, 1), (1, 6), (0, 0), (0, 0), (0, 0)]
  ]
}

Notice the offset mapping for the word drieme in the first case. First word has mappings (0, 1) and (1, 6). This looks reasonable, however the second drieme is (1, 1) and (1, 6). Suddenly, there is 1 at the first position. This 1 is there for all but first word for any sentence I try to parse. I feel like it might have something to do with handling the start of the sentence vs all the other words, but I am not sure how to solve this issue so that I have proper offset mappings. What am I doing wrong?

Thanks for reporting, it’s definitely a bug. Could you open an issue on tokenizers with your snippet?

Having similar issue as well. Is there a proper fix yet?

Don’t know about the fix, but I solved it by detecting the special character that is used at the start of each word. In my case this was Ġ. I used this instead of offset mapping to detect the start of the word.

Hello,

Is there any update on this topic? I am encountering the same issue with any tokenizer that has the add_prefix_space=True flag, which seems to be needed for example for RoBERTa tokenizers. A few more examples to exemplify the issue, using words = ['hello', 'world', 'foo', 'bar', '32', '31290485']:

  • Using a non-RoBERTa tokenizer doesn’t show the issue and works as expected:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokenizer(words, is_split_into_words=True, return_offsets_mapping=True)
# This returns:
# 'input_ids': [101, 7592, 2088, 29379, 3347, 3590, 21036, 21057, 18139, 2629, 102]
# 'offset_mapping': [(0, 0), (0, 5), (0, 5), (0, 3), (0, 3), (0, 2), (0, 3), (3, 5), (5, 7), (7, 8), (0, 0)]
# Running tokenizer.add_prefix_space would raise an AttributeError

in the above, as expected, special tokens have offsets (0, 0), beginning of words are (0, n), with n being the word/chunk length, continuations of words are of the form (n, m) with n > 0 being the start of the chunk, and m - n the chunk length.

  • However, for RoBERTa gives the following:
tokenizer = AutoTokenizer.from_pretrained('distilroberta-base', add_prefix_space=True)
tokenizer(words, is_split_into_words=True, return_offsets_mapping=True)
# This returns
# 'input_ids': [0, 20760, 232, 48769, 2003, 2107, 32490, 3248, 33467, 2]
# 'offset_mapping': [(0, 0), (0, 5), (1, 5), (1, 3), (1, 3), (1, 2), (1, 3), (3, 5), (5, 8), (0, 0)
tokenizer.add_prefix_space
# This returns True

The special tokens are still marked as (0, 0) offset, the first word is still of the form (0, n), the chunks from second on are still of the form (n, m), but each word beginning after the second is now of the form (1, n), with n still being the word/chunk length. So this looks to be the only exception to the rule that the length of the chunk is m - n, and that’s not good. I suppose the beginning 1 is because, as the flag name suggests, a space is added before the words and not taken into account when running the offsets_mapping, but as suggested above that looks like a bug.

For my purpose (running NER) I can probably workaround the above by checking explicitly if the word is not the first word in the sentence, but the situation gets weirder when the chunks should be just 1 character long (so that, with the above logic, the offsets_mapping should then be something like (1, 1), (1, 2)). For example, using words = ['hello', 'world', 'foo', 'bar', '32', '31290485', 'üù'], I am getting:

{'input_ids': [0, 20760, 232, 48769, 2003, 2107, 32490, 3248, 33467, 952, 4394, 3849, 9253, 2],
'offset_mapping': [(0, 0), (0, 5), (1, 5), (1, 3), (1, 3), (1, 2), (1, 3), (3, 5), (5, 8), (1, 1), (0, 1), (1, 2), (1, 2), (0, 0)]}

contradicting what I was expecting from above; I guess in this case, first the space that is implicitly added before üù gets the offset (1, 1), and then the letter ü (which doesn’t have any space before it anymore) correctly gets the offset (0, 1), but then somehow the offset for ù (which should indeed be (1, 2)`, being the second half of a two-character word) gets… doubled?

Edit: the weirdness in the last part seems to be explained by how the tokenizer works on Unicode characters, encoding them by piece. So ü is actually encoded as ü and ù is encoded as ù, as explained by running tokenizer.convert_ids_to_tokens on the output. It is made all a bit cleaner by running tokenizer(' '.join(words)) and looking at its result, where indeed the tokens are duplicated, but also the offsets are correctly duplicated:

{'input_ids': [0, 20760, 232, 48769, 2003, 2107, 32490, 3248, 33467, 952, 4394, 3849, 9253, 2],
'offset_mapping': [(0, 0), (0, 5), (6, 11), (12, 15), (16, 19), (20, 22), (23, 26), (26, 28), (28, 31), (32, 33), (32, 33), (33, 34), (33, 34), (0, 0)]}

I guess that for now I’ll stick to using tokenizer(' '.join(words)) instead of tokenizer(words, is_split_into_words=True)

1 Like