Is there an equivalent of BatchEncoding.char_to_token
for the outputs of model.generate
?
e.g. suppose I had the following output by calling tokenizer.decode(model.generate(...)["sequences"])
:
"the cat sat on the mat"
where the output of tokenizer.convert_ids_to_tokens(model.generate(...)["sequences"])
may look like:
["_", "the", "_ca", "t", "_sat", "_on", "_", "the", "_mat"]
How would I get the token indices ([2, 3, 4]
) for the part of the output corresponding to "cat sat"
(e.g. if I wanted to inspect the attention)?
If there were an equivalent of the BatchEncoding.char_to_token
function then I could get the character indices of "cat sat"
from "".join(tokenizer.convert_ids_to_tokens(model.generate(...)["sequences"]))
and then call char_to_token
, but as far as I am aware something like this doesn’t exist? Is there an alternative method?
Thanks!