Hello,
Is there any update on this topic? I am encountering the same issue with any tokenizer that has the add_prefix_space=True
flag, which seems to be needed for example for RoBERTa tokenizers. A few more examples to exemplify the issue, using words = ['hello', 'world', 'foo', 'bar', '32', '31290485']
:
- Using a non-RoBERTa tokenizer doesn’t show the issue and works as expected:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokenizer(words, is_split_into_words=True, return_offsets_mapping=True)
# This returns:
# 'input_ids': [101, 7592, 2088, 29379, 3347, 3590, 21036, 21057, 18139, 2629, 102]
# 'offset_mapping': [(0, 0), (0, 5), (0, 5), (0, 3), (0, 3), (0, 2), (0, 3), (3, 5), (5, 7), (7, 8), (0, 0)]
# Running tokenizer.add_prefix_space would raise an AttributeError
in the above, as expected, special tokens have offsets (0, 0), beginning of words are (0, n), with n being the word/chunk length, continuations of words are of the form (n, m) with n > 0
being the start of the chunk, and m - n
the chunk length.
- However, for RoBERTa gives the following:
tokenizer = AutoTokenizer.from_pretrained('distilroberta-base', add_prefix_space=True)
tokenizer(words, is_split_into_words=True, return_offsets_mapping=True)
# This returns
# 'input_ids': [0, 20760, 232, 48769, 2003, 2107, 32490, 3248, 33467, 2]
# 'offset_mapping': [(0, 0), (0, 5), (1, 5), (1, 3), (1, 3), (1, 2), (1, 3), (3, 5), (5, 8), (0, 0)
tokenizer.add_prefix_space
# This returns True
The special tokens are still marked as (0, 0) offset, the first word is still of the form (0, n), the chunks from second on are still of the form (n, m), but each word beginning after the second is now of the form (1, n), with n still being the word/chunk length. So this looks to be the only exception to the rule that the length of the chunk is m - n
, and that’s not good. I suppose the beginning 1
is because, as the flag name suggests, a space is added before the words and not taken into account when running the offsets_mapping, but as suggested above that looks like a bug.
For my purpose (running NER) I can probably workaround the above by checking explicitly if the word is not the first word in the sentence, but the situation gets weirder when the chunks should be just 1 character long (so that, with the above logic, the offsets_mapping should then be something like (1, 1), (1, 2)). For example, using words = ['hello', 'world', 'foo', 'bar', '32', '31290485', 'üù']
, I am getting:
{'input_ids': [0, 20760, 232, 48769, 2003, 2107, 32490, 3248, 33467, 952, 4394, 3849, 9253, 2],
'offset_mapping': [(0, 0), (0, 5), (1, 5), (1, 3), (1, 3), (1, 2), (1, 3), (3, 5), (5, 8), (1, 1), (0, 1), (1, 2), (1, 2), (0, 0)]}
contradicting what I was expecting from above; I guess in this case, first the space that is implicitly added before üù
gets the offset (1, 1), and then the letter ü
(which doesn’t have any space before it anymore) correctly gets the offset (0, 1), but then somehow the offset for ù
(which should indeed be (1, 2)`, being the second half of a two-character word) gets… doubled?
Edit: the weirdness in the last part seems to be explained by how the tokenizer works on Unicode characters, encoding them by piece. So ü
is actually encoded as ü
and ù
is encoded as ù
, as explained by running tokenizer.convert_ids_to_tokens
on the output. It is made all a bit cleaner by running tokenizer(' '.join(words))
and looking at its result, where indeed the tokens are duplicated, but also the offsets are correctly duplicated:
{'input_ids': [0, 20760, 232, 48769, 2003, 2107, 32490, 3248, 33467, 952, 4394, 3849, 9253, 2],
'offset_mapping': [(0, 0), (0, 5), (6, 11), (12, 15), (16, 19), (20, 22), (23, 26), (26, 28), (28, 31), (32, 33), (32, 33), (33, 34), (33, 34), (0, 0)]}
I guess that for now I’ll stick to using tokenizer(' '.join(words))
instead of tokenizer(words, is_split_into_words=True)
…