Using inference api on model that returns an audio file

Hello! How can I use the inference api on a model that receives text and returns audio file. I am trying to use this model: espnet/kan-bayashi_ljspeech_vits. What should response look like from the inference api? Thanks:)