Whisper finetune for multilingual tasks

Hello community.

I’m using whisper-large-v2 for audio transcription.
I have the task of transcribing audio in Kazakh and Russian.
The problem is that in Kazakhstan, people often use Russian words when speaking Kazakh. I tried to fine tune a model where the dataset consisted of 50% data in Kazakh and 50% in Russian, but the result did not please me since the model, for example, recognized audio as " Kazakh language" and did not transcribe Russian words, and vice versa.
Would it be acceptable to create a tokenizer of the “ru-kk” format to combine 2 languages ​​into one? And will it be possible to fine tune the base model using such a tokenizer?