How to use the inference api on tts model?

danijelpetkovic · November 30, 2021, 3:04pm

Hi, how can I use the inference api on this model: espnet/kan-bayashi_ljspeech_vits?
It receives text and should return audio file, since it is text to speech model.

osanseviero · December 1, 2021, 8:48am

Hey there. Here is a snippet to do this. It will return the raw data, so you will still need to use something such as ffmpeg to process it. cc @Narsil

import json
import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/espnet/kan-bayashi_ljspeech_vits"

def query(payload):
    data = json.dumps(payload)
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return response.content

data = query({"inputs": "test"})

danijelpetkovic · December 1, 2021, 3:43pm

@osanseviero Hey! Thanks for the help. I managed to call the api and to get a response:)
Regarding processing of the raw data, does it mean I should run it throw ffmpeg and create an audio file from it, or you meant something different? Thanks again!

osanseviero · December 1, 2021, 3:57pm

@Narsil might know what’s the most recommended way to load the audio in this point. I usually use ffmpeg but in the context of small tests and quick debugging. I’ll let Nicolas give a more proper answer.

Btw, we have an audio study group forming in our Discord server: Hugging Face. The first session will be in two weeks in case you’re interested.

Narsil · December 3, 2021, 9:54am

@danijelpetkovic ,

At this point you should have received a real audio file. If you just want to listen you can just save it

with open("out.flac", "wb") as f:
    f.write(response.content)

And in order to listen to it, use your favorite media player.

vlc out.flac for me. Windows media player, winamp, foobar, chrome and firefox can probably read it too.

ffmpeg is necessary only if want multiple files and want to compress them a little. The API responds flac files, which are lossless compression, but you can get much better compression with lossy algorithm (all to save disk, it might not be necessary to you)

danijelpetkovic · December 14, 2021, 6:36pm

@Narsil Thank you very much! I managed to save and play the file exactly as you described above. The only thing that I still can’t figure out is why the audio is distorted, it is not the same voice as it was when I was calling the model directly. I tried to add parameters to the inference api:

audio_file = query_audio_tts({
            "inputs": generated_answer,
            "parameters": {
                "vocoder_tag": "str_or_none(none)",
                "threshold": 0.5,
                "minlenratio": 0.0,
                "maxlenratio": 10.0,
                "use_att_constraint": False,
                "backward_window": 1,
                "forward_window": 3,
                "speed_control_alpha": 1.0,
                "noise_scale": 0.333,
                "noise_scale_dur": 0.333
            }
        })

But this didn’t help. Do you have any ideas here? Thanks!

Narsil · December 14, 2021, 8:07pm

@danijelpetkovic Do you have a reference for how you create your files without distortion ?

Currently espnet and other libraries (except transformers) don’t support adding parameters unfortunately. Since every library will have a different set of parameters, and maintaining each and everyone of them will very fast become tedious and probably clashes and bugs (not even mentioning the docs for that).

That being said, if we can improve the defaults for some models we should definitely do so, the implementations for this API is defined here: huggingface_hub/automatic_speech_recognition.py at main · huggingface/huggingface_hub · GitHub

As you can see it’s pretty bare bones, any help to improve that is welcome.

danijelpetkovic · December 15, 2021, 9:41am

@Narsil Yes, sure. Here is the code chunk which I was using before:

from espnet2.bin.tts_inference import Text2Speech
from espnet2.utils.types import str_or_none
import torch
import scipy.io.wavfile

tag = 'kan-bayashi/ljspeech_vits' 
vocoder_tag = "none"

text2speech = Text2Speech.from_pretrained(
    model_tag=str_or_none(tag),
    vocoder_tag=str_or_none(vocoder_tag),
    device="cpu",
    # Only for Tacotron 2 & Transformer
    threshold=0.5,
    # Only for Tacotron 2
    minlenratio=0.0,
    maxlenratio=10.0,
    use_att_constraint=False,
    backward_window=1,
    forward_window=3,
    # Only for FastSpeech & FastSpeech2 & VITS
    speed_control_alpha=1.0,
    # Only for VITS
    noise_scale=0.333,
    noise_scale_dur=0.333,
)

def get_audio_tts(text):
  with torch.no_grad():
    wav = text2speech(text)["wav"]
    scipy.io.wavfile.write("out.wav",text2speech.fs , wav.view(-1).cpu().numpy())
  return  "out.wav"

audio_file = get_audio_tts(generated_answer)

Narsil · December 15, 2021, 9:58am

Those options do seem to depend on the model looking at the comments.

Do you think you could maybe create a PR here: huggingface_hub/automatic_speech_recognition.py at main · huggingface/huggingface_hub · GitHub ?

We could ping some espnet maintainers to take a look.

danijelpetkovic · December 15, 2021, 3:47pm

@Narsil I created a new issue and referenced the file you sent me.

github.com/huggingface/huggingface_hub

Use espnet/kan-bayashi_ljspeech_vits model via inference api

opened 03:46PM - 15 Dec 21 UTC

danijelpetkovic

https://github.com/huggingface/huggingface_hub/blob/eeeb0d1b352fb3249541bb62f1f5…7f41ae3ab4e0/api-inference-community/docker_images/espnet/app/pipelines/automatic_speech_recognition.py#L8 Hey guys. I tried to use the mentioned model via inference api and the audio that comes as an response is distorted. When I try to import the model and call it directly it returns a nice audio with nice voice: ``` from espnet2.bin.tts_inference import Text2Speech from espnet2.utils.types import str_or_none tag = 'kan-bayashi/ljspeech_vits' vocoder_tag = "none" text2speech = Text2Speech.from_pretrained( model_tag=str_or_none(tag), vocoder_tag=str_or_none(vocoder_tag), device="cpu", threshold=0.5, minlenratio=0.0, maxlenratio=10.0, use_att_constraint=False, backward_window=1, forward_window=3, speed_control_alpha=1.0, noise_scale=0.333, noise_scale_dur=0.333, ) ```

Narsil · December 16, 2021, 9:47am

Turns out the issue lied with the sampling rate in fact. Here is the fix Fixing FS for `espnet`. by Narsil · Pull Request #542 · huggingface/huggingface_hub · GitHub

danijelpetkovic · December 16, 2021, 10:03am

@Narsil Thanks for a fast response and the fix. I will test it in the app and let you know!

danijelpetkovic · December 20, 2021, 1:01pm

@Narsil Hey. Sorry if I am missing something, maybe I misunderstood. But I tried the response of the api after the fix from above and it still keeps returning distorted voice.

danijelpetkovic · December 23, 2021, 5:41pm

@Narsil Hey I found an interesting thing and I created a small app to show the problem. Depending on the content sent to the tts model, the voice is being returned differently.

You can just try to paste this 2 slightly different paragraphs and you will see the difference in the voice.

“Water heated to room temperature feels colder than the air around it. This is because the temperature difference between the water and the air is greater than that of the air surrounding it.”

“Water heated to room temperature feels colder than the air around it. This is because the temperature difference between water and air is greater than the difference between the temperature of the water and the air.”

Narsil · January 3, 2022, 1:58pm

This looks like a cache issue (There’s a cache in front of the API to prevent calculating things over and over).

You can try adding {"inputs": "....", {"parameters": {"use_cache": False}} to your input to force the output to be calculated.

The caching mechanism should be upgraded at some point so you don’t have to do this.

Topic		Replies	Views
Using inference api on espnet/kan-bayashi_ljspeech_vits model Beginners	0	378	November 27, 2021
Using inference api on model that returns an audio file Models	0	376	November 23, 2021
Text to Speech error Beginners	0	419	June 25, 2023
Inference Model with API and Integrate to LM (Language Model) 🤗Transformers	0	634	June 7, 2022
How to use Inference API to perform speech recognition Beginners	1	191	October 12, 2024

How to use the inference api on tts model?

Related topics