I just found token_to_chars
when tokenized_text = tokenizer(text)
, and it seems to be what I need. I’ll take a deeper look.
tokenized = tokenizer(text)
num_of_tokens = len(tokenized_text["input_ids"])
for i in range(num_of_tokens):
charspan = tokenized_text.token_to_chars(i)
print(charspan.start, charspan.end)