Issues with offset_mapping values

matus · March 8, 2021, 12:44pm

Hi guys, I am trying to work with a FairSeq model converted to but I have some issues with tokenizer. I am trying to fine-tune it for POS tagging so I have the text split to words already and I want to use the offset_mapping to detect first token for each word. I do it like this:

tokenizer = RobertaTokenizerFast.from_pretrained('path', add_prefix_space=True)
ids = tokenizer([['drieme', 'drieme'], ['drieme']],
    is_split_into_words=True,
    padding=True,
    return_offsets_mapping=True)

The tokenization looks like this:

['<s>', 'Ġd', 'rieme', 'Ġd', 'rieme', '</s>']

But the output from the command looks like this:

{
  'input_ids': [[0, 543, 24209, 543, 24209, 2], [0, 543, 24209, 2, 1, 1]],
  'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 0, 0]],
  'offset_mapping': [
    [(0, 0), (0, 1), (1, 6), (1, 1), (1, 6), (0, 0)],
    [(0, 0), (0, 1), (1, 6), (0, 0), (0, 0), (0, 0)]
  ]
}

Notice the offset mapping for the word drieme in the first case. First word has mappings (0, 1) and (1, 6). This looks reasonable, however the second drieme is (1, 1) and (1, 6). Suddenly, there is 1 at the first position. This 1 is there for all but first word for any sentence I try to parse. I feel like it might have something to do with handling the start of the sentence vs all the other words, but I am not sure how to solve this issue so that I have proper offset mappings. What am I doing wrong?

Topic		Replies	Views
Different Behaviors between Tokenizers for Question Answering 🤗Transformers	0	337	October 20, 2021
Bug with tokernizer's offset mapping for NER problems? 🤗Tokenizers	3	7217	December 23, 2020
Offset mappings differ for tokenizers 🤗Tokenizers	0	1788	October 30, 2023
Tokenizers offset issue Beginners	0	665	September 8, 2022
BUGs on offset-mapping 🤗Tokenizers	0	183	May 24, 2024

Issues with offset_mapping values

Related topics