Loading custom audio dataset and fine-tuning model

Hi all. I’m very new to HuggingFace and I have a question that I hope someone can help with.

I was suggested the XLSR-53 (Wav2Vec) model for my use-case which is a speech to text model. However, the languages I require aren’t supported so I was told I need to fine-tune the model per my requirements. I’ve seen several documentation but they all use Common Voice which also doesn’t support what I need.

I have ~4 hours audio files and tsv files (annotations of the audio) but I am not sure how to load them and fine-tune the model with them. I can’t find much info online either. Is there any reference I can follow?

Any help would be appreciated.

@patrickvonplaten I am also trying it out for a similar usecase but couldnt find any example script till now for audio datasets other than CommonVoice. I have several datasets with me which arent available on huggingface datasets but because almost all the scripts rely so much on the usage of huggingface datasets its hard to get my head around it to change it my use cases. If you can suggest me any resources or any changes so that I can use my own dataset inspite of Commonvoice or any other dataset available on huggingface datasets it would be of great help.