Hi everyone!
I hope that I am not asking something that has already been answered somewhere else.
I am currently implementing a QAModel that should support multiple tokenizers&models. However, I am a bit puzzled by the behavior of offset_mapping and char_to_token across different tokenizers.
Consider the following code:
from transformers import AutoTokenizer
from functools import lru_cache
@lru_cache(maxsize=128)
def get_tokenizer(t, add_prefix_space):
return AutoTokenizer.from_pretrained(t, add_prefix_space=add_prefix_space, use_fast=True)
def f(question, context, position, tokenizer, add_prefix_space):
t = get_tokenizer(tokenizer, add_prefix_space)
to = t(question, context, return_offsets_mapping=True)
print(f'Original position: {position}')
print(f'Back-mapped position: {to["offset_mapping"][to.char_to_token(position, sequence_index=1)][0]}')
print()
for _id, _c in zip(to["input_ids"], to["offset_mapping"]):
print(f'\t\'{t.decode([_id])}\'\t{_c}')
If I use something like bert-base-cased, everything works as expected and I get:
f('Where is Italy?', 'In Italy', 3, 'bert-base-cased', add_prefix_space=True)
Original position: 3
Back-mapped position: 3
'[CLS]' (0, 0)
'Where' (0, 5)
'is' (6, 8)
'Italy' (9, 14)
'?' (14, 15)
'[SEP]' (0, 0)
'In' (0, 2)
'Italy' (3, 8)
'[SEP]' (0, 0)
However, if I switch to microsoft/deberta-large:
f('Where is Italy?', 'In Italy', 3, 'microsoft/deberta-large', add_prefix_space=True)
Original position: 3
Back-mapped position: 2
'[CLS]' (0, 0)
' Where' (0, 5)
' is' (5, 8)
' Italy' (8, 14)
'?' (14, 15)
'[SEP]' (0, 0)
' In' (0, 2)
' Italy' (2, 8)
'[SEP]' (0, 0)
With the backmapping process that fails (which is a problem for inference) as the space is also considered. Furthermore, if I switch to a similar Tokenizer like BartTokenizer (if I am not mistaken both DeBertaTokenizer and BartTokenizer should inherit from GPT2Tokenizer), things change further:
f('Where is Italy?', 'In Italy', 3, 'facebook/bart-large', add_prefix_space=True)
Original position: 3
Back-mapped position: 3
'<s>' (0, 0)
' Where' (1, 5)
' is' (6, 8)
' Italy' (9, 14)
'?' (14, 15)
'</s>' (0, 0)
'</s>' (0, 0)
' In' (1, 2)
' Italy' (3, 8)
'</s>' (0, 0)
With the backmapping that properly works. However, a new problem arises as the token “In” is now incorrectly mapped in (1, 2); I guess this happens due to add_prefix_space (indeed, setting it to False maps “In” back to (0, 2)), yet this seems like a bug to me and, had the position been 0, the mapping would have failed.
So, what I am wondering is whether I am missing something in how to the mapping between chars and tokens is performed, or whether these different behaviors are expected.
Thank you!