Text To Speech In Real-Time

I have used various models to do text to speech:
I had several sentences, sent them to the model endpoint and got voice.

Now i’m facing a new challenge - my sentences are loaded dynamically and I want to send them word by word to the model and get the voice dynamically (kind of real time text to speech).

I was looking for several tools docs but couldn’t find any tool that does it,
Does someone handled this issue before and can help?

Just to emphasize i’m talking about Text To Speech