Bug in Offset generation for Rupee symbol

Giriteja · June 27, 2022, 7:32am

Hi, When I am using the rupee symbol in a sentence Offset is dividing that symbol into 3 different symbols but instead of having (0,1)(1,2)(2,3), it is giving (0,1)(0,1)(0,1) which is causing issues in a mismatch between actual words and generated labels.For example

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(‘distilroberta-base’, add_prefix_space=True)

sent=“total amount that need to be paid is ₹ 500”

words=sent.split()

output=tokenizer(words, is_split_into_words=True,return_offsets_mapping=True)

tokens=output.tokens()

offset=output[‘offset_mapping’]

for token,offset in zip(tokens,offset):

print(token,“----->”,offset)

I am getting the following output
-----> (0, 0)
Ġtotal -----> (0, 5)
Ġamount -----> (0, 6)
Ġthat -----> (0, 4)
Ġneed -----> (0, 4)
Ġto -----> (0, 2)
Ġbe -----> (0, 2)
Ġpaid -----> (0, 4)
Ġis -----> (0, 2)
Ġâ -----> (0, 1) #problem
Ĥ -----> (0, 1)#problem
¹ -----> (0, 1)#poblem
Ġ500 -----> (0, 3)
-----> (0, 0)

As you can see above rupee symbol got divided in to 3 different labels but offset is still (0,1) for all three symbols

Topic		Replies	Views
Offset mappings differ for tokenizers 🤗Tokenizers	0	1686	October 30, 2023
Tokenizers offset issue Beginners	0	663	September 8, 2022
BUGs on offset-mapping 🤗Tokenizers	0	174	May 24, 2024
Issues with offset_mapping values 🤗Tokenizers	4	4468	February 15, 2022
Return_offsets_mapping when decoding 🤗Tokenizers	3	37	April 25, 2025

Bug in Offset generation for Rupee symbol

Related topics