I’m trying to map offsets of subtokens back to their original text. It seems that they are off by one in some cases depending on the used tokenizer and if special tokens are used. I would like to know if there are patterns to this behaviour e.g. always if a ByteLevel encoding is used, the offsets need to be shifted by one to account for the begin-prefix (Ġ). My goal is to implement this mapping so it works for every tokenizer. Is this possible? Are there rules that the tokenizers follow when it comes to the offsets?
For example running the following code:
from transformers import AutoTokenizer model_name = "microsoft/mdeberta-v3-base" text = "This is a great \n test." tokenizer: PreTrainedTokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.add_tokens(["\n"], special_tokens=True) encoding: BatchEncoding = tokenizer(text, return_offsets_mapping=True, add_special_tokens=False) print(tokenizer.convert_ids_to_tokens(encoding["input_ids"])) # map back to original text using the offsets tokens = [text[begin:end] for begin, end in encoding.offset_mapping] print(tokens)
with deberta results in
['▁This', '▁is', '▁', 'a', '▁great', '\n', '▁test', '.']
['This', ' is', ' ', 'a', ' great', '\n', 'test', '.']
and with gpt-2, I get
['This', 'Ġis', 'Ġa', 'Ġgreat', 'Ġ', '\n', 'Ġtest', '.']
['This', ' is', ' a', ' great', ' ', '\n', ' test', '.'].
For gpt-2, I could correct the offsets by adding one to the begin offset if the subtoken starts with a begin-prefix. However, this does not work for the deberta subtokens, because e.g. the offset after the special token does not follow this logic.
Thanks for every reply.