How to map generated characters to tokens?

harshil-shah · September 21, 2022, 2:35pm

Is there an equivalent of BatchEncoding.char_to_token for the outputs of model.generate?

e.g. suppose I had the following output by calling tokenizer.decode(model.generate(...)["sequences"]):

"the cat sat on the mat"

where the output of tokenizer.convert_ids_to_tokens(model.generate(...)["sequences"]) may look like:

["_", "the", "_ca", "t", "_sat", "_on", "_", "the", "_mat"]

How would I get the token indices ([2, 3, 4]) for the part of the output corresponding to "cat sat" (e.g. if I wanted to inspect the attention)?

If there were an equivalent of the BatchEncoding.char_to_token function then I could get the character indices of "cat sat" from "".join(tokenizer.convert_ids_to_tokens(model.generate(...)["sequences"])) and then call char_to_token, but as far as I am aware something like this doesn’t exist? Is there an alternative method?

Thanks!

Topic		Replies	Views
Return_offsets_mapping when decoding 🤗Tokenizers	3	32	April 25, 2025
Is there a way to return the "decoder_input_ids" from "tokenizer.prepare_seq2seq_batch"? 🤗Transformers	5	3344	December 29, 2020
How does `tokenizer().input_ids` work and how different it is from tokenizer.encode() before `model.generate()` and decoding step? 🤗Tokenizers	1	2866	February 22, 2023
Best way to get the closest token indices of input of char_to_token is a whitespace 🤗Tokenizers	0	995	February 19, 2023
Decode token IDs into a list (not a single string) 🤗Tokenizers	4	4101	March 11, 2025

How to map generated characters to tokens?

Related topics