Multimodal architectures with HuggingFace transformers for speech and text

Hi, could somebody provide me some useful tutorials about building multimodal architectures for speech and text with the help of PyTorch and HuggingFace transformers?

Thank you in advance.


We do have the SpeechEncoderDecoderModel class, which consists of an audio Transformer encoder and a language Transformer decoder: Speech Encoder Decoder Models.

1 Like

Thank you. If there are other transformers or resources that would be useful, I am open for other ideas.

It is not exactly what I was looking for unfortunately. I want to extract features with the help of attention mechanisms from both audio and text input and then to merge both of them and feed the output in a recurrent neural network.