Single embedding from single wav file for wav2vec models?

I am not sure if here is the right channel to ask.

I am new to wav2vec models and aware that wav2vec usually acts as a “frontend” model so we gotta have embeddings or features from them. I used the script below to produce embeddings for future use. The output from a single wav file is [1, 212, 1024] for hidden states and [1, 212, 512] for features.

If I wanna have a single one-dimensional embedding (in either 1024 or 512 dim), would simple averaging be a valid solution?

Source of the code: python - Getting embeddings from wav2vec2 models in HuggingFace - Stack Overflow

import librosa
import torch
from transformers import Wav2Vec2FeatureExtractor, Wav2Vec2Model

input_audio, sample_rate = librosa.load("/content/test.wav",  sr=16000)

model_name = "facebook/wav2vec2-large-960h"
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(model_name)
model = Wav2Vec2Model.from_pretrained(model_name)

i= feature_extractor(input_audio, return_tensors="pt", sampling_rate=sample_rate)
with torch.no_grad():
  o = model(i.input_values)
print(o.keys())
print(o.last_hidden_state.shape)
print(o.extract_features.shape)