Wav2vec2 finetuning and language model

Hi there,
I’m working on the wav2vec2-base-960h finetuning for disordered speech recognition, my speech model has been trained on my custom dataset containing isolated words (no sentences) uttered by speakers with atypical voices. On isolated word recognition tasks, the performance of my speech model is very good, however the model does not recognize the sequence of more keywords within a single speech recording. How can I recognize two or more keywords in a recording? Should I use a language model? Any suggestions?
Thanks in advance,