I’m working on fine-tuning a speech model (either Wav2Vec or HuBERT) to classify speech at the frame level (e.g., every 20ms of audio must be classified). Specifically, I’m looking to use the
Wav2Vec2ForAudioFrameClassification method, but I’m uncertain about the shape and format of the labels required.
For this task, I’d like to input a torch tensor with a length corresponding to the audio length divided by the frame length (e.g., 20ms), containing binary values of 0s or 1s. Is this approach feasible? I’m having difficulty understanding the precise requirements for the labels.
Does anyone have an example of using
Wav2Vec2ForAudioFrameClassification, or can someone guide me on the correct shape and formatting for the labels? Any help would be appreciated, thanks a lot!