Streaming token output from models like T5

Hey, I’m looking to stream tokens out of text2textgeneration models (more specifically those from the T5 family). Similar to what you see in OpenAI playground when requesting generations.

For text-generation models I’ve been able to emulate this (since we’re just predicting current token from the ones before it) by setting a max_time parameter and appending model generations to the prompt to continuously call until I’ve reached the number of tokens I wanted to generate. But I can’t do the same for text2text models.

Any advice on how I could stream output from models like T5? Is it even possible with the architecture?

Got a solution working, in generate() for the different types of sampling for example greedy_search() there is a next_token variable you can incrementally get the subsequent tokens generated by the model as soon as they are done. You’ll have to decode it yourself and encode the special rules you’d get from decode() but it works well. Monkey patched it with a new greedy_search() yielding next_token on each generation. Hope this helps for anyone looking to do the same.

1 Like

Inspired by the solution form @zhuda , I made a streaming generation service for Hugging Face transformers that is fully compatible with the OpenAI API:

1 Like

Hey guys, why didn’t you register one of the callbacks in generate and choose instead to monkey patch?

This is awesome, can’t wait take a proper look!


We recently released a bunch of open-source tools to do this with any HuggingFace model.

Check out our new library “text-generation-inference”: GitHub - huggingface/text-generation-inference: Large Language Model Text Generation Inference. It powers this Space: Chat Llm Streaming - a Hugging Face Space by olivierdehaene.

cc @olivierdehaene