Issues with offset_mapping values

matus · March 8, 2021, 12:44pm

Hi guys, I am trying to work with a FairSeq model converted to but I have some issues with tokenizer. I am trying to fine-tune it for POS tagging so I have the text split to words already and I want to use the offset_mapping to detect first token for each word. I do it like this:

tokenizer = RobertaTokenizerFast.from_pretrained('path', add_prefix_space=True)
ids = tokenizer([['drieme', 'drieme'], ['drieme']],
    is_split_into_words=True,
    padding=True,
    return_offsets_mapping=True)

The tokenization looks like this:

['<s>', 'Ġd', 'rieme', 'Ġd', 'rieme', '</s>']

But the output from the command looks like this:

{
  'input_ids': [[0, 543, 24209, 543, 24209, 2], [0, 543, 24209, 2, 1, 1]],
  'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 0, 0]],
  'offset_mapping': [
    [(0, 0), (0, 1), (1, 6), (1, 1), (1, 6), (0, 0)],
    [(0, 0), (0, 1), (1, 6), (0, 0), (0, 0), (0, 0)]
  ]
}

Notice the offset mapping for the word drieme in the first case. First word has mappings (0, 1) and (1, 6). This looks reasonable, however the second drieme is (1, 1) and (1, 6). Suddenly, there is 1 at the first position. This 1 is there for all but first word for any sentence I try to parse. I feel like it might have something to do with handling the start of the sentence vs all the other words, but I am not sure how to solve this issue so that I have proper offset mappings. What am I doing wrong?

sgugger · March 9, 2021, 11:11pm

Thanks for reporting, it’s definitely a bug. Could you open an issue on tokenizers with your snippet?

proxyht · April 10, 2021, 8:11am

Having similar issue as well. Is there a proper fix yet?

matus · April 10, 2021, 8:26am

Don’t know about the fix, but I solved it by detecting the special character that is used at the start of each word. In my case this was Ġ. I used this instead of offset mapping to detect the start of the word.

mspinaci · February 15, 2022, 1:13pm

Hello,

Is there any update on this topic? I am encountering the same issue with any tokenizer that has the add_prefix_space=True flag, which seems to be needed for example for RoBERTa tokenizers. A few more examples to exemplify the issue, using words = ['hello', 'world', 'foo', 'bar', '32', '31290485']:

Using a non-RoBERTa tokenizer doesn’t show the issue and works as expected:

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokenizer(words, is_split_into_words=True, return_offsets_mapping=True)
# This returns:
# 'input_ids': [101, 7592, 2088, 29379, 3347, 3590, 21036, 21057, 18139, 2629, 102]
# 'offset_mapping': [(0, 0), (0, 5), (0, 5), (0, 3), (0, 3), (0, 2), (0, 3), (3, 5), (5, 7), (7, 8), (0, 0)]
# Running tokenizer.add_prefix_space would raise an AttributeError

in the above, as expected, special tokens have offsets (0, 0), beginning of words are (0, n), with n being the word/chunk length, continuations of words are of the form (n, m) with n > 0 being the start of the chunk, and m - n the chunk length.

However, for RoBERTa gives the following:

tokenizer = AutoTokenizer.from_pretrained('distilroberta-base', add_prefix_space=True)
tokenizer(words, is_split_into_words=True, return_offsets_mapping=True)
# This returns
# 'input_ids': [0, 20760, 232, 48769, 2003, 2107, 32490, 3248, 33467, 2]
# 'offset_mapping': [(0, 0), (0, 5), (1, 5), (1, 3), (1, 3), (1, 2), (1, 3), (3, 5), (5, 8), (0, 0)
tokenizer.add_prefix_space
# This returns True

The special tokens are still marked as (0, 0) offset, the first word is still of the form (0, n), the chunks from second on are still of the form (n, m), but each word beginning after the second is now of the form (1, n), with n still being the word/chunk length. So this looks to be the only exception to the rule that the length of the chunk is m - n, and that’s not good. I suppose the beginning 1 is because, as the flag name suggests, a space is added before the words and not taken into account when running the offsets_mapping, but as suggested above that looks like a bug.

For my purpose (running NER) I can probably workaround the above by checking explicitly if the word is not the first word in the sentence, but the situation gets weirder when the chunks should be just 1 character long (so that, with the above logic, the offsets_mapping should then be something like (1, 1), (1, 2)). For example, using words = ['hello', 'world', 'foo', 'bar', '32', '31290485', 'üù'], I am getting:

{'input_ids': [0, 20760, 232, 48769, 2003, 2107, 32490, 3248, 33467, 952, 4394, 3849, 9253, 2],
'offset_mapping': [(0, 0), (0, 5), (1, 5), (1, 3), (1, 3), (1, 2), (1, 3), (3, 5), (5, 8), (1, 1), (0, 1), (1, 2), (1, 2), (0, 0)]}

contradicting what I was expecting from above; I guess in this case, first the space that is implicitly added before üù gets the offset (1, 1), and then the letter ü (which doesn’t have any space before it anymore) correctly gets the offset (0, 1), but then somehow the offset for ù (which should indeed be (1, 2)`, being the second half of a two-character word) gets… doubled?

Edit: the weirdness in the last part seems to be explained by how the tokenizer works on Unicode characters, encoding them by piece. So ü is actually encoded as Ã¼ and ù is encoded as Ã¹, as explained by running tokenizer.convert_ids_to_tokens on the output. It is made all a bit cleaner by running tokenizer(' '.join(words)) and looking at its result, where indeed the tokens are duplicated, but also the offsets are correctly duplicated:

{'input_ids': [0, 20760, 232, 48769, 2003, 2107, 32490, 3248, 33467, 952, 4394, 3849, 9253, 2],
'offset_mapping': [(0, 0), (0, 5), (6, 11), (12, 15), (16, 19), (20, 22), (23, 26), (26, 28), (28, 31), (32, 33), (32, 33), (33, 34), (33, 34), (0, 0)]}

I guess that for now I’ll stick to using tokenizer(' '.join(words)) instead of tokenizer(words, is_split_into_words=True)…

Topic		Replies	Views
Bug with tokernizer's offset mapping for NER problems? 🤗Tokenizers	3	7243	December 23, 2020
Different Behaviors between Tokenizers for Question Answering 🤗Transformers	0	341	October 20, 2021
Offset mappings differ for tokenizers 🤗Tokenizers	0	1849	October 30, 2023
Tokenizers offset issue Beginners	0	676	September 8, 2022
Xlm-Roberta Tokenizing 🤗Transformers	3	478	January 19, 2021

Issues with offset_mapping values

Related topics