Different Behaviors between Tokenizers for Question Answering

poccio · October 20, 2021, 4:19pm

Hi everyone!

I hope that I am not asking something that has already been answered somewhere else.

I am currently implementing a QAModel that should support multiple tokenizers&models. However, I am a bit puzzled by the behavior of offset_mapping and char_to_token across different tokenizers.

Consider the following code:

from transformers import AutoTokenizer
from functools import lru_cache

@lru_cache(maxsize=128)
def get_tokenizer(t, add_prefix_space):
    return AutoTokenizer.from_pretrained(t, add_prefix_space=add_prefix_space, use_fast=True)

def f(question, context, position, tokenizer, add_prefix_space):
    t = get_tokenizer(tokenizer, add_prefix_space)
    to = t(question, context, return_offsets_mapping=True)
    print(f'Original position: {position}')
    print(f'Back-mapped position: {to["offset_mapping"][to.char_to_token(position, sequence_index=1)][0]}')
    print()
    for _id, _c in zip(to["input_ids"], to["offset_mapping"]):
        print(f'\t\'{t.decode([_id])}\'\t{_c}')

If I use something like bert-base-cased, everything works as expected and I get:

f('Where is Italy?', 'In Italy', 3, 'bert-base-cased', add_prefix_space=True)

Original position: 3
Back-mapped position: 3
	'[CLS]'	(0, 0)
	'Where'	(0, 5)
	'is'	(6, 8)
	'Italy'	(9, 14)
	'?'	(14, 15)
	'[SEP]'	(0, 0)
	'In'	(0, 2)
	'Italy'	(3, 8)
	'[SEP]'	(0, 0)

However, if I switch to microsoft/deberta-large:

f('Where is Italy?', 'In Italy', 3, 'microsoft/deberta-large', add_prefix_space=True)

Original position: 3
Back-mapped position: 2
	'[CLS]'	(0, 0)
	' Where'	(0, 5)
	' is'	(5, 8)
	' Italy'	(8, 14)
	'?'	(14, 15)
	'[SEP]'	(0, 0)
	' In'	(0, 2)
	' Italy'	(2, 8)
	'[SEP]'	(0, 0)

With the backmapping process that fails (which is a problem for inference) as the space is also considered. Furthermore, if I switch to a similar Tokenizer like BartTokenizer (if I am not mistaken both DeBertaTokenizer and BartTokenizer should inherit from GPT2Tokenizer), things change further:

f('Where is Italy?', 'In Italy', 3, 'facebook/bart-large', add_prefix_space=True)

Original position: 3
Back-mapped position: 3
	'<s>'	(0, 0)
	' Where'	(1, 5)
	' is'	(6, 8)
	' Italy'	(9, 14)
	'?'	(14, 15)
	'</s>'	(0, 0)
	'</s>'	(0, 0)
	' In'	(1, 2)
	' Italy'	(3, 8)
	'</s>'	(0, 0)

With the backmapping that properly works. However, a new problem arises as the token “In” is now incorrectly mapped in (1, 2); I guess this happens due to add_prefix_space (indeed, setting it to False maps “In” back to (0, 2)), yet this seems like a bug to me and, had the position been 0, the mapping would have failed.

So, what I am wondering is whether I am missing something in how to the mapping between chars and tokens is performed, or whether these different behaviors are expected.

Thank you!

Topic		Replies	Views
Offset mappings differ for tokenizers 🤗Tokenizers	0	1691	October 30, 2023
Tokenizers offset issue Beginners	0	663	September 8, 2022
BUGs on offset-mapping 🤗Tokenizers	0	174	May 24, 2024
Return_offsets_mapping when decoding 🤗Tokenizers	3	38	April 25, 2025
Transformers v3.0.0 is out! 🤗Transformers	0	1937	July 7, 2020

Different Behaviors between Tokenizers for Question Answering

Related topics