Decding Large Audio Files Using Wav2Vec2ForCTC Model

I’ve been working on Wav2Vec2ForCTC model for a while. I used to have small audio files, i.e., audio files with relatively short durations (~ 1 min). When I tested the model on a large file (~ 14 mins), the model could not handle it in GPU, so, I shifted to use CPU. I notices that it used more than 200 GB of RAM to decode! I’ve tried to split the audio file into smaller audio files and use hidden states to link them together while decoding each segment but I could not find a way to feed the hidden states of the current audio file to the next audio file for the model to use it while decoding!

Any ideas or suggestions?


Try using VAD to split the large audio file into smaller clips, using the pauses (ends of sentences). ASR performs well without any special way of feeding state from one clip to the next.

Thank you @permutans.
The problem is that I’m ganna lose the context of the whole audio file if I decode each small segment alone. I put this option as the last option in my list because of that issue.