Models for Automatic Speech Recognition

There are some stuff that I don’t understand while reading the literature. To build an ASR we can say that we can stack LSTM units for both encoder and decoder. When we extract audio features like number of channels, the intensity, … it is those features that we pass to the LSTM units right ? and those LSTM units for encoder and decoder are they sufficient like to go from speech to the transcription ? is it the same when using a CNN or the CNN is considered as the acoustic part on top of which we have to add a LM ?