I am trying to get the embeddings from pre-trained wav2vec2 models (e.g., from jonatasgrosman/wav2vec2-large-xlsr-53-german) using my own dataset.
My aim is to use these features for a downstream task (not specifically speech recognition). Namely, since the dataset is relatively small, I would train an SVM with these embeddings for the final classification.
So far I have tried this:
model_name = "facebook/wav2vec2-large-xlsr-53-german" feature_extractor = Wav2Vec2Processor.from_pretrained(model_name) model = Wav2Vec2Model.from_pretrained(model_name) input_values = feature_extractor(train_dataset[:10]["speech"], return_tensors="pt", padding=True, feature_size=1, sampling_rate=16000 ).input_values
Then, I am not sure whether the embeddings here correspond to the sequence of last_hidden_states:
hidden_states = model(input_values).last_hidden_state
or to the sequence of features of the last conv layer of the model:
features_last_cnn_layer = model(input_values).extract_features
Also, is this the correct way to extract features from a pre-trained model?
How one can get embeddings from a specific layer?