Hey, I’m looking to stream tokens out of text2textgeneration models (more specifically those from the T5 family). Similar to what you see in OpenAI playground when requesting generations.
For text-generation models I’ve been able to emulate this (since we’re just predicting current token from the ones before it) by setting a max_time parameter and appending model generations to the prompt to continuously call until I’ve reached the number of tokens I wanted to generate. But I can’t do the same for text2text models.
Any advice on how I could stream output from models like T5? Is it even possible with the architecture?
Got a solution working, in
generate() for the different types of sampling for example
greedy_search() there is a
next_token variable you can incrementally get the subsequent tokens generated by the model as soon as they are done. You’ll have to decode it yourself and encode the special rules you’d get from
decode() but it works well. Monkey patched it with a new
next_token on each generation. Hope this helps for anyone looking to do the same.
Inspired by the solution form @zhuda , I made a streaming generation service for Hugging Face transformers that is fully compatible with the OpenAI API: https://github.com/hyperonym/basaran
Hey guys, why didn’t you register one of the callbacks in generate and choose instead to monkey patch?
This is awesome, can’t wait take a proper look!
We recently released a bunch of open-source tools to do this with any HuggingFace model.
Check out our new library “text-generation-inference”: GitHub - huggingface/text-generation-inference: Large Language Model Text Generation Inference. It powers this Space: Chat Llm Streaming - a Hugging Face Space by olivierdehaene.
We now also support a new
Streamer class that works in tandem with the
Here’s a great Twitter thread by @joaogante going over it: https://twitter.com/joao_gante/status/1643330507093196800
This is perfect, will port over to this solution soon for OpenPlayground