Hey, I’m looking to stream tokens out of text2textgeneration models (more specifically those from the T5 family). Similar to what you see in OpenAI playground when requesting generations.
For text-generation models I’ve been able to emulate this (since we’re just predicting current token from the ones before it) by setting a max_time parameter and appending model generations to the prompt to continuously call until I’ve reached the number of tokens I wanted to generate. But I can’t do the same for text2text models.
Any advice on how I could stream output from models like T5? Is it even possible with the architecture?
Got a solution working, in generate() for the different types of sampling for example greedy_search() there is a next_token variable you can incrementally get the subsequent tokens generated by the model as soon as they are done. You’ll have to decode it yourself and encode the special rules you’d get from decode() but it works well. Monkey patched it with a new greedy_search() yielding next_token on each generation. Hope this helps for anyone looking to do the same.