Issues with offset_mapping values

Hi guys, I am trying to work with a FairSeq model converted to :hugs: but I have some issues with tokenizer. I am trying to fine-tune it for POS tagging so I have the text split to words already and I want to use the offset_mapping to detect first token for each word. I do it like this:

tokenizer = RobertaTokenizerFast.from_pretrained('path', add_prefix_space=True)
ids = tokenizer([['drieme', 'drieme'], ['drieme']],
    is_split_into_words=True,
    padding=True,
    return_offsets_mapping=True)

The tokenization looks like this:

['<s>', 'Ġd', 'rieme', 'Ġd', 'rieme', '</s>']

But the output from the command looks like this:

{
  'input_ids': [[0, 543, 24209, 543, 24209, 2], [0, 543, 24209, 2, 1, 1]],
  'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 0, 0]],
  'offset_mapping': [
    [(0, 0), (0, 1), (1, 6), (1, 1), (1, 6), (0, 0)],
    [(0, 0), (0, 1), (1, 6), (0, 0), (0, 0), (0, 0)]
  ]
}

Notice the offset mapping for the word drieme in the first case. First word has mappings (0, 1) and (1, 6). This looks reasonable, however the second drieme is (1, 1) and (1, 6). Suddenly, there is 1 at the first position. This 1 is there for all but first word for any sentence I try to parse. I feel like it might have something to do with handling the start of the sentence vs all the other words, but I am not sure how to solve this issue so that I have proper offset mappings. What am I doing wrong?

Thanks for reporting, it’s definitely a bug. Could you open an issue on tokenizers with your snippet?

Having similar issue as well. Is there a proper fix yet?

Don’t know about the fix, but I solved it by detecting the special character that is used at the start of each word. In my case this was Ġ. I used this instead of offset mapping to detect the start of the word.