I am looking for a way to run server-side speech recognition with the lowest latency possible.
The ideal solution would process audio in realtime by receiving audio samples in a stream (the stream could work on websockets or anything else). A low-code solution would be preferable.
- Do Inference Endpoints have functionality that allows that?
- If not, do any of other HF products have functionality that allow / enable that?
- If not, can you recommend any third-party solutions?
Can you please also recommend a Speech-to-Text model capable of processing input in a form od a stream? I am currently using Whisper (with Inference Endpoints), and, unfortunately, it can only process an audio file as a whole.