Getting embeddings from wav2vec2 models

Hi, there!

I am trying to get the embeddings from pre-trained wav2vec2 models (e.g., from jonatasgrosman/wav2vec2-large-xlsr-53-german) using my own dataset.

My aim is to use these features for a downstream task (not specifically speech recognition). Namely, since the dataset is relatively small, I would train an SVM with these embeddings for the final classification.

So far I have tried this:

model_name = "facebook/wav2vec2-large-xlsr-53-german"
feature_extractor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2Model.from_pretrained(model_name)

input_values = feature_extractor(train_dataset[:10]["speech"], return_tensors="pt", padding=True, 
                                 feature_size=1, sampling_rate=16000 ).input_values 

Then, I am not sure whether the embeddings here correspond to the sequence of last_hidden_states:

hidden_states = model(input_values).last_hidden_state

or to the sequence of features of the last conv layer of the model:

features_last_cnn_layer = model(input_values).extract_features

Also, is this the correct way to extract features from a pre-trained model?
How one can get embeddings from a specific layer?