I am not sure if here is the right channel to ask.
I am new to wav2vec models and aware that wav2vec usually acts as a “frontend” model so we gotta have embeddings or features from them. I used the script below to produce embeddings for future use. The output from a single wav file is [1, 212, 1024] for hidden states and [1, 212, 512] for features.
If I wanna have a single one-dimensional embedding (in either 1024 or 512 dim), would simple averaging be a valid solution?
Source of the code: python - Getting embeddings from wav2vec2 models in HuggingFace - Stack Overflow
import librosa
import torch
from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2Model
input_audio, sample_rate = librosa.load("/content/test.wav", sr=16000)
model_name = "facebook/wav2vec2-large-960h"
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)
model = Wav2Vec2Model.from_pretrained(model_name)
i= feature_extractor(input_audio, return_tensors="pt", sampling_rate=sample_rate)
with torch.no_grad():
o = model(i.input_values)
print(o.keys())
print(o.last_hidden_state.shape)
print(o.extract_features.shape)