I am not sure if here is the right channel to ask.
I am new to wav2vec models and aware that wav2vec usually acts as a “frontend” model so we gotta have embeddings or features from them. I used the script below to produce embeddings for future use. The output from a single wav file is [1, 212, 1024] for hidden states and [1, 212, 512] for features.
If I wanna have a single one-dimensional embedding (in either 1024 or 512 dim), would simple averaging be a valid solution?
Source of the code: python - Getting embeddings from wav2vec2 models in HuggingFace - Stack Overflow
import librosa import torch from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2Model input_audio, sample_rate = librosa.load("/content/test.wav", sr=16000) model_name = "facebook/wav2vec2-large-960h" feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name) model = Wav2Vec2Model.from_pretrained(model_name) i= feature_extractor(input_audio, return_tensors="pt", sampling_rate=sample_rate) with torch.no_grad(): o = model(i.input_values) print(o.keys()) print(o.last_hidden_state.shape) print(o.extract_features.shape)