Multimodal architectures with HuggingFace transformers for speech and text

Hi, could somebody provide me some useful tutorials about building multimodal architectures for speech and text with the help of PyTorch and HuggingFace transformers?

Thank you in advance.

Hi,

We do have the SpeechEncoderDecoderModel class, which consists of an audio Transformer encoder and a language Transformer decoder: Speech Encoder Decoder Models.

Update here: we support many more models involving speech and text, such as Wav2Vec2, Whisper, etc. We have an audio course which tells you all about it: Welcome to the Hugging Face Audio course! - Hugging Face Audio Course

1 Like

Thank you. If there are other transformers or resources that would be useful, I am open for other ideas.

It is not exactly what I was looking for unfortunately. I want to extract features with the help of attention mechanisms from both audio and text input and then to merge both of them and feed the output in a recurrent neural network.