Word-by-word TTS model for minimal latency

I’m working on building a Jarvis-style conversational AI assistant that utilizes a large language model (LLM) behind the scenes. However, I want to make the experience as seamless and natural as possible by having the assistant start speaking as soon as the LLM starts generating its response, token by token.

To achieve this, I need a text-to-speech (TTS) model that can operate with extremely low latency and generate audio in a word-by-word or phoneme-by-phoneme fashion as the text stream comes in. Ideally, the TTS should sound natural and conversational, without any robotic or unnatural qualities.

Does anyone have any recommendations for such word-by-word TTS model?