Multimodal architectures with HuggingFace transformers for speech and text

Razvanip · November 9, 2022, 12:39pm

Hi, could somebody provide me some useful tutorials about building multimodal architectures for speech and text with the help of PyTorch and HuggingFace transformers?

Thank you in advance.

nielsr · November 10, 2022, 8:59am

Hi,

We do have the SpeechEncoderDecoderModel class, which consists of an audio Transformer encoder and a language Transformer decoder: Speech Encoder Decoder Models.

Update here: we support many more models involving speech and text, such as Wav2Vec2, Whisper, etc. We have an audio course which tells you all about it: Welcome to the Hugging Face Audio course! - Hugging Face Audio Course

Razvanip · November 11, 2022, 12:39pm

Thank you. If there are other transformers or resources that would be useful, I am open for other ideas.

Razvanip · November 14, 2022, 11:39am

It is not exactly what I was looking for unfortunately. I want to extract features with the help of attention mechanisms from both audio and text input and then to merge both of them and feed the output in a recurrent neural network.

Topic		Replies	Views
Multi-Encoder Transformer Models	0	304	August 14, 2023
Separate pre-trained encoder and decoder Models	0	437	October 4, 2023
Query on Hugging Face's Transformer Library \| Julio Herrera Beginners	0	14	July 23, 2024
Multimodal transformer Models	0	1071	April 23, 2023
Building an variational autoencoder with transformers Beginners	1	704	March 17, 2024

Multimodal architectures with HuggingFace transformers for speech and text

Related topics