Two way translation Speech to Speech model EN-DE

Hi

I am doing a project with the goal to create a model which can translate Speech to Speech in real time EN-DE and DE-EN.

I have found the facebook one way translation facebook/textless_sm_cs_en · Hugging Face

The problem is that i tried the other way with facebook/s2t-wav2vec2-large-en-de · Hugging Face but it seems to be crashing.

I was thinking of using a model to convert EN Speech to Text then translate the EN Text to DE Text and then a Text to Speech to output the DE Speech.

I am not sure how to continue from here.
Can you give me some tips?

Thank you and regards

I’m working at this task before but translate from English to Arabic, Build cascaded pipeline consists of

  1. Speech recognition for En use Wav2Vec model.
  2. Punctuations restoration because Wav2Vec dismiss punctuations, we use deepmultilingualpunctuation model.
  3. Machine Translation from En to Ar use mbart model.
  4. Tashkeel restoration: you can learn more here
  5. Text2Speech using fastSpeech2 model.

This pipeline very slow, computationally expensive, and the result not good at all. soon i will publish this work on GitHub for further discussion. At this time we try to simplify this pipeline using SpeechT5 for End2End Speech Translation or replace the first 3 models with one model can translate En Audio to Ar text and also work on first open source Automatic video dubbing from En to Ar and vis versa at first then add support for other language.

1 Like