Getting spans from tokenizer

I just found token_to_chars when tokenized_text = tokenizer(text), and it seems to be what I need. I’ll take a deeper look.


tokenized = tokenizer(text)
num_of_tokens = len(tokenized_text["input_ids"])
for i in range(num_of_tokens):
    charspan = tokenized_text.token_to_chars(i)
    print(charspan.start, charspan.end)
1 Like