I want to implement a LLM inference server which holds a collection of huggingface models, but for stream inference, which return a token at a time. then token which returns may not enough to decode to a readable word. So what should I do to achieve such goal: only returns when token can be decode to a readable word?
please endulge my poor English…