Offset mappings differ for tokenizers

UWinch · October 30, 2023, 12:37pm

Hi,

I’m trying to map offsets of subtokens back to their original text. It seems that they are off by one in some cases depending on the used tokenizer and if special tokens are used. I would like to know if there are patterns to this behaviour e.g. always if a ByteLevel encoding is used, the offsets need to be shifted by one to account for the begin-prefix (Ġ). My goal is to implement this mapping so it works for every tokenizer. Is this possible? Are there rules that the tokenizers follow when it comes to the offsets?

For example running the following code:

from transformers import AutoTokenizer

model_name = "microsoft/mdeberta-v3-base"
text = "This is a great \n test."

tokenizer: PreTrainedTokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.add_tokens(["\n"], special_tokens=True)


encoding: BatchEncoding = tokenizer(text, return_offsets_mapping=True, add_special_tokens=False)
print(tokenizer.convert_ids_to_tokens(encoding["input_ids"]))

# map back to original text using the offsets
tokens = [text[begin:end] for begin, end in encoding.offset_mapping]
print(tokens)

with deberta results in
['▁This', '▁is', '▁', 'a', '▁great', '\n', '▁test', '.']
['This', ' is', ' ', 'a', ' great', '\n', 'test', '.']

and with gpt-2, I get
['This', 'Ġis', 'Ġa', 'Ġgreat', 'Ġ', '\n', 'Ġtest', '.']
['This', ' is', ' a', ' great', ' ', '\n', ' test', '.'].

For gpt-2, I could correct the offsets by adding one to the begin offset if the subtoken starts with a begin-prefix. However, this does not work for the deberta subtokens, because e.g. the offset after the special token does not follow this logic.

Thanks for every reply.

Topic		Replies	Views
Tokenizers offset issue Beginners	0	662	September 8, 2022
Issues with offset_mapping values 🤗Tokenizers	4	4448	February 15, 2022
BUGs on offset-mapping 🤗Tokenizers	0	171	May 24, 2024
Different Behaviors between Tokenizers for Question Answering 🤗Transformers	0	337	October 20, 2021
Return_offsets_mapping when decoding 🤗Tokenizers	3	31	April 25, 2025

Offset mappings differ for tokenizers

Related topics