Decding Large Audio Files Using Wav2Vec2ForCTC Model

farisalasmary · October 26, 2021, 3:22pm

I’ve been working on Wav2Vec2ForCTC model for a while. I used to have small audio files, i.e., audio files with relatively short durations (~ 1 min). When I tested the model on a large file (~ 14 mins), the model could not handle it in GPU, so, I shifted to use CPU. I notices that it used more than 200 GB of RAM to decode! I’ve tried to split the audio file into smaller audio files and use hidden states to link them together while decoding each segment but I could not find a way to feed the hidden states of the current audio file to the next audio file for the model to use it while decoding!

Any ideas or suggestions?

@patrickvonplaten

permutans · October 27, 2021, 1:46pm

Try using VAD to split the large audio file into smaller clips, using the pauses (ends of sentences). ASR performs well without any special way of feeding state from one clip to the next.

farisalasmary · October 28, 2021, 11:30am

Thank you @permutans.
The problem is that I’m ganna lose the context of the whole audio file if I decode each small segment alone. I put this option as the last option in my list because of that issue.

Topic		Replies	Views
How to finetune wav2vec2.0-xlsr model with long audio files Beginners	1	825	September 6, 2022
Wav2vec2 for long audiofiles Beginners	2	4124	March 18, 2022
Using wav2vec2 for own usecase Beginners	2	313	May 13, 2021
Wav2Vec2ForCTC abandons one logit sometimes Models	1	429	October 26, 2022
Wav2Vec 2 audio processing Models	0	140	June 3, 2024

Decding Large Audio Files Using Wav2Vec2ForCTC Model

Related topics