Realtime speech-to-text solution?

I am looking for a way to run server-side speech recognition with the lowest latency possible.

The ideal solution would process audio in realtime by receiving audio samples in a stream (the stream could work on websockets or anything else). A low-code solution would be preferable.

  1. Do Inference Endpoints have functionality that allows that?
  2. If not, do any of other HF products have functionality that allow / enable that?
  3. If not, can you recommend any third-party solutions?

Can you please also recommend a Speech-to-Text model capable of processing input in a form od a stream? I am currently using Whisper (with Inference Endpoints), and, unfortunately, it can only process an audio file as a whole.