Hi guys, I am trying to work with a FairSeq model converted to but I have some issues with tokenizer. I am trying to fine-tune it for POS tagging so I have the text split to words already and I want to use the offset_mapping to detect first token for each word. I do it like this:
tokenizer = RobertaTokenizerFast.from_pretrained('path', add_prefix_space=True)
ids = tokenizer([['drieme', 'drieme'], ['drieme']],
is_split_into_words=True,
padding=True,
return_offsets_mapping=True)
The tokenization looks like this:
['<s>', 'Ä d', 'rieme', 'Ä d', 'rieme', '</s>']
But the output from the command looks like this:
{
'input_ids': [[0, 543, 24209, 543, 24209, 2], [0, 543, 24209, 2, 1, 1]],
'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 0, 0]],
'offset_mapping': [
[(0, 0), (0, 1), (1, 6), (1, 1), (1, 6), (0, 0)],
[(0, 0), (0, 1), (1, 6), (0, 0), (0, 0), (0, 0)]
]
}
Notice the offset mapping for the word drieme in the first case. First word has mappings (0, 1)
and (1, 6)
. This looks reasonable, however the second drieme is (1, 1)
and (1, 6)
. Suddenly, there is 1
at the first position. This 1
is there for all but first word for any sentence I try to parse. I feel like it might have something to do with handling the start of the sentence vs all the other words, but I am not sure how to solve this issue so that I have proper offset mappings. What am I doing wrong?