Hi, When I am using the rupee symbol in a sentence Offset is dividing that symbol into 3 different symbols but instead of having (0,1)(1,2)(2,3), it is giving (0,1)(0,1)(0,1) which is causing issues in a mismatch between actual words and generated labels.For example
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(âdistilroberta-baseâ, add_prefix_space=True)
sent=âtotal amount that need to be paid is â¹ 500â
words=sent.split()
output=tokenizer(words, is_split_into_words=True,return_offsets_mapping=True)
tokens=output.tokens()
offset=output[âoffset_mappingâ]
for token,offset in zip(tokens,offset):
print(token,â----->â,offset)
I am getting the following output
-----> (0, 0) -----> (0, 0)
Ä total -----> (0, 5)
Ä amount -----> (0, 6)
Ä that -----> (0, 4)
Ä need -----> (0, 4)
Ä to -----> (0, 2)
Ä be -----> (0, 2)
Ä paid -----> (0, 4)
Ä is -----> (0, 2)
Ä Ã¢ -----> (0, 1) #problem
Ä€ -----> (0, 1)#problem
¹ -----> (0, 1)#poblem
Ä 500 -----> (0, 3)
As you can see above rupee symbol got divided in to 3 different labels but offset is still (0,1) for all three symbols