Two way translation Speech to Speech model EN-DE


I am doing a project with the goal to create a model which can translate Speech to Speech in real time EN-DE and DE-EN.

I have found the facebook one way translation facebook/textless_sm_cs_en · Hugging Face

The problem is that i tried the other way with facebook/s2t-wav2vec2-large-en-de · Hugging Face but it seems to be crashing.

I was thinking of using a model to convert EN Speech to Text then translate the EN Text to DE Text and then a Text to Speech to output the DE Speech.

I am not sure how to continue from here.
Can you give me some tips?

Thank you and regards

