I am currently using facebook/wav2vec2-base
model for an audio classification task. My code is based on the official HF Audio Classification tutorial.
In this tutorial, audios are sampled using a sample_rate
of 16.000. This means that 1 second of audio results in an array of length 16.000. Also, Wav2Vec2
maximum input length is 150.000 as far as I know.
Does this mean that, without any chunking, the model can only process audios just up to ~10 seconds?
If you have longer audios, which i guess is usually the case, what strategies can be applied to mitigate this issue?