How to use the inference api on tts model?

Hi, how can I use the inference api on this model: espnet/kan-bayashi_ljspeech_vits?
It receives text and should return audio file, since it is text to speech model.

Hey there. Here is a snippet to do this. It will return the raw data, so you will still need to use something such as ffmpeg to process it. cc @Narsil

import json
import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/espnet/kan-bayashi_ljspeech_vits"

def query(payload):
    data = json.dumps(payload)
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return response.content

data = query({"inputs": "test"})

@osanseviero Hey! Thanks for the help. I managed to call the api and to get a response:)
Regarding processing of the raw data, does it mean I should run it throw ffmpeg and create an audio file from it, or you meant something different? Thanks again!

1 Like

@Narsil might know what’s the most recommended way to load the audio in this point. I usually use ffmpeg but in the context of small tests and quick debugging. I’ll let Nicolas give a more proper answer.

Btw, we have an audio study group forming in our Discord server: Hugging Face. The first session will be in two weeks in case you’re interested.

@danijelpetkovic ,

At this point you should have received a real audio file. If you just want to listen you can just save it

with open("out.flac", "wb") as f:
    f.write(response.content)

And in order to listen to it, use your favorite media player.

vlc out.flac for me. Windows media player, winamp, foobar, chrome and firefox can probably read it too.

ffmpeg is necessary only if want multiple files and want to compress them a little. The API responds flac files, which are lossless compression, but you can get much better compression with lossy algorithm (all to save disk, it might not be necessary to you)

@Narsil Thank you very much! I managed to save and play the file exactly as you described above. The only thing that I still can’t figure out is why the audio is distorted, it is not the same voice as it was when I was calling the model directly. I tried to add parameters to the inference api:

audio_file = query_audio_tts({
            "inputs": generated_answer,
            "parameters": {
                "vocoder_tag": "str_or_none(none)",
                "threshold": 0.5,
                "minlenratio": 0.0,
                "maxlenratio": 10.0,
                "use_att_constraint": False,
                "backward_window": 1,
                "forward_window": 3,
                "speed_control_alpha": 1.0,
                "noise_scale": 0.333,
                "noise_scale_dur": 0.333
            }
        })

But this didn’t help. Do you have any ideas here? Thanks!

@danijelpetkovic Do you have a reference for how you create your files without distortion ?

Currently espnet and other libraries (except transformers) don’t support adding parameters unfortunately. Since every library will have a different set of parameters, and maintaining each and everyone of them will very fast become tedious and probably clashes and bugs (not even mentioning the docs for that).

That being said, if we can improve the defaults for some models we should definitely do so, the implementations for this API is defined here: huggingface_hub/automatic_speech_recognition.py at main · huggingface/huggingface_hub · GitHub

As you can see it’s pretty bare bones, any help to improve that is welcome.

@Narsil Yes, sure. Here is the code chunk which I was using before:

from espnet2.bin.tts_inference import Text2Speech
from espnet2.utils.types import str_or_none
import torch
import scipy.io.wavfile

tag = 'kan-bayashi/ljspeech_vits' 
vocoder_tag = "none"

text2speech = Text2Speech.from_pretrained(
    model_tag=str_or_none(tag),
    vocoder_tag=str_or_none(vocoder_tag),
    device="cpu",
    # Only for Tacotron 2 & Transformer
    threshold=0.5,
    # Only for Tacotron 2
    minlenratio=0.0,
    maxlenratio=10.0,
    use_att_constraint=False,
    backward_window=1,
    forward_window=3,
    # Only for FastSpeech & FastSpeech2 & VITS
    speed_control_alpha=1.0,
    # Only for VITS
    noise_scale=0.333,
    noise_scale_dur=0.333,
)

def get_audio_tts(text):
  with torch.no_grad():
    wav = text2speech(text)["wav"]
    scipy.io.wavfile.write("out.wav",text2speech.fs , wav.view(-1).cpu().numpy())
  return  "out.wav"

audio_file = get_audio_tts(generated_answer)

Those options do seem to depend on the model looking at the comments.

Do you think you could maybe create a PR here: huggingface_hub/automatic_speech_recognition.py at main · huggingface/huggingface_hub · GitHub ?

We could ping some espnet maintainers to take a look.

@Narsil I created a new issue and referenced the file you sent me.

1 Like

Turns out the issue lied with the sampling rate in fact. Here is the fix Fixing FS for `espnet`. by Narsil · Pull Request #542 · huggingface/huggingface_hub · GitHub

1 Like

@Narsil Thanks for a fast response and the fix. I will test it in the app and let you know!

@Narsil Hey. Sorry if I am missing something, maybe I misunderstood. But I tried the response of the api after the fix from above and it still keeps returning distorted voice.

@Narsil Hey :slight_smile: I found an interesting thing and I created a small app to show the problem. Depending on the content sent to the tts model, the voice is being returned differently.

You can just try to paste this 2 slightly different paragraphs and you will see the difference in the voice.

“Water heated to room temperature feels colder than the air around it. This is because the temperature difference between the water and the air is greater than that of the air surrounding it.”

“Water heated to room temperature feels colder than the air around it. This is because the temperature difference between water and air is greater than the difference between the temperature of the water and the air.”

This looks like a cache issue (There’s a cache in front of the API to prevent calculating things over and over).

You can try adding {"inputs": "....", {"parameters": {"use_cache": False}} to your input to force the output to be calculated.

The caching mechanism should be upgraded at some point so you don’t have to do this.