ASR help with sequence of words

Hi there,

by following this tutorial:

I have finetuned the model “facebook/wav2vec2-base” on my custom dataset containing single words uttered by people with atypical speech. I have observed a high word recognition accuracy (greater than 95%). Now I would like to use the same dataset to recognize small sequences of words. As an example, if I have trained the model on the keywords “volume” and “up”, I would like to recognize the sequence “volume up” within a speech recording. Is it possible? Any idea to achieve this with Transformers?

Thanks in advance,