How to finetune wav2vec2.0-xlsr model with long audio files

Hi everyone,

I followed a tutorial (Fine-Tune XLSR-Wav2Vec2 for low-resource ASR with 🤗 Transformers) to learn how to fine-tune a wav2vec2.0 model.
I managed to finetune this model with the common voice datasets.
Then, I tried to use my custom datasets to do it again but encountered a ‘CUDA out of memory’ error.
The duration of my audio files is around 10 minutes, I guess they are quite long and lack GPU memory.
I’d like to know if there are some methods to process such long audio files when fine-tuning the wav2vec2.0 model?
Any suggestions are appreciated.

If not using many GPUs in parallel, you might have to find a way to trim your audio files to somewhere in between 15-30 seconds, that worked for me ( also, make sure you’re using fp16).