Models for Automatic Speech Recognition

diallomama · March 3, 2023, 10:33am

Hello,
There are some stuff that I don’t understand while reading the literature. To build an ASR we can say that we can stack LSTM units for both encoder and decoder. When we extract audio features like number of channels, the intensity, … it is those features that we pass to the LSTM units right ? and those LSTM units for encoder and decoder are they sufficient like to go from speech to the transcription ? is it the same when using a CNN or the CNN is considered as the acoustic part on top of which we have to add a LM ?

Topic		Replies	Views
Wav2vec2 finetuning and language model Beginners	0	213	October 1, 2023
Encoder decoder model 🤗Transformers	0	292	December 23, 2022
Fine-tuning Decoder-only or Encoder-Decoder models for classification 🤗Transformers	0	686	July 17, 2024
A hypothetical question on multi-headed wav2vec2 / hubert models 🤗Transformers	0	345	December 15, 2021
Image captioning decoder Languages at Hugging Face	4	1472	January 6, 2022

Models for Automatic Speech Recognition

Related topics