Using Wav2Vec in speech classification/regression problems

Hi Wav2Vec enthusiasts,

I created a script for using Wav2Vec 2.0 in speech classification/regression problems. I tested the model on Persian and Greek and got significant results even way better than what they proposed in their papers.

I’m going to share this script/model with you to take advantage of it in your research, and let me know your results and feedback.

Repo: GitHub - m3hrdadfi/soxan: Wav2Vec2 for speech recognition and classification


Hello @m3hrdadfi,

Great work, I see that you created a script which can decide regression or classification is going to be used by looking the “num_labels” extracted from csv files.

I am trying to estimate some neurological scores from sound for Parkinson’s disease patients.

And I am going to try to build a six dimensional regression model. I will give to wav2vec model a wav file and give six floating point labels means an array of six elements and want to make the model predict these labels.

The question is how can I prepare the CSV files and feed the model?

And do I need to make any change to existing model? You have added a classification layer at the top of the model which has two parameters config.hidden_size and config.num_labels. I think I have to change these parameters right?

Could you give some help please?


Hi @darkcurrent,

Thanks, If you want to use the script, there’s a need to change some parts, but I’ll guide you through the notebook, and if it helped you, please contribute to improving the code for general use (repo).

Suppose you have a schema like this, and the emotion is a list of floats/integers.

    features: ['path', 'emotion'],
    num_rows: xxx

path: "/to/path/idk.wav",
emotion: [1.1, 2.1, 2.4, ..., 1.5]

First of all, we need to change the label information:

label_list = train_dataset.unique(output_column)
label_list.sort()  # Let's sort it for determinism
num_labels = len(label_list)
print(f"A classification problem with {num_labels} classes: {label_list}")

To this

num_labels = len(train_dataset[0][output_column])
label_list = list(range(nnu_labels))
print(f"A regression problem with {num_labels} items: {label_list}")
is_regression = True

Then, add the problem type to the config and adjust the preprocessing step and the label type in the collator fn.

# config
config = AutoConfig.from_pretrained(
    label2id={label: i for i, label in enumerate(label_list)},
    id2label={i: label for i, label in enumerate(label_list)},
setattr(config, 'pooling_mode', pooling_mode)
# preprocess
def speech_file_to_array_fn(path):
    speech_array, sampling_rate = torchaudio.load(path)
    resampler = torchaudio.transforms.Resample(sampling_rate, target_sampling_rate)
    speech = resampler(speech_array).squeeze().numpy()
    return speech

def preprocess_function(examples):
    speech_list = [speech_file_to_array_fn(path) for path in examples[input_column]]
    target_list = [label for label in examples[output_column]] # Do any preprocessing on your float/integer data

    result = processor(speech_list, sampling_rate=target_sampling_rate)
    result["labels"] = list(target_list)

    return result
# collator
from dataclasses import dataclass
from typing import Dict, List, Optional, Union
import torch

import transformers
from transformers import Wav2Vec2Processor

class DataCollatorCTCWithPadding:
    processor: Wav2Vec2Processor
    padding: Union[bool, str] = True
    max_length: Optional[int] = None
    max_length_labels: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None
    pad_to_multiple_of_labels: Optional[int] = None

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_values": feature["input_values"]} for feature in features]
        label_features = [feature["labels"] for feature in features]

        # d_type = torch.long if isinstance(label_features[0], int) else torch.float
        d_type = torch.float

        batch = self.processor.pad(

        batch["labels"] = torch.tensor(label_features, dtype=d_type)

        return batch

If I continue the process, it will be a long reply :thinking:, so I’ll attach the modified notebook for better intuition.

Google Colab Notebook: Regression Example

Have you tried emotion classification along with emotion detection?

Thank you for your great effort @m3hrdadfi. And sorry for the delayed answer.

1 Like

Have you tried to identify language from a speech signal using your procedure?
If so I am interested in language identification from speech signals that will be a good application of common voice database in the first place.

Hi @m3hrdadfi,

I am experimenting with your scripts for emotion classification/regression. Specifically, I am trying to adjust it for single variable regression. Unfortunately, I have encountered a couple of problems.

  1. During the training I get:
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/ UserWarning: Using a target size (torch.Size([4])) that is different to the input size (torch.Size([4, 1])). This will likely lead to incorrect results due to broadcasting. Please ensure they have the same size.
  return F.mse_loss(input, target, reduction=self.reduction)
The following columns in the evaluation set  don't have a corresponding argument in `Wav2Vec2ForSpeechClassification.forward` and have been ignored: path, values, name.
  1. When I do the evaluation, all my predictions are the same value.

Would you be willing to help, please?

Thank you in advance!

Have you checked my previous example of Multi-Regression? You must set this value problem_type=" regression" in your configuration; otherwise probably unable to act as a regression problem.

I would more understand your situation if you share your code as a notebook.

Yes, I have checked your Multi-Regression example and have problem_type=" regression". You can check the code: Regression - Emotion recognition in Greek speech using Wav2Vec2.ipynb - Colaboratory ( Thank you!

Is that the actual data you’re going to use in this model, or just a way to show your dataset behavior? Because it’s a random uniform dist that you assigned, and probably it won’t produce logical outputs in that specific problem (SER-AESDD).

BTW, the code procedure sounds acceptable to me.

I found wav2vec interesting. Unfortunately, I am new to HuggingFace and wav2vec2 so I wanted to get familiar with them a bit. Therefore, I assigned random values for files to play with code and test regression. I do understand that the outputs will not be reasonable, but I do not think that I should get the same prediction for all the files. In addition, the broadcasting problem might relate to the classification of emotion? I do not know, I am a beginner, so I thought you could help because I do know to fix it.

Hello @m3hrdadfi!

First of all, thanks for your amazing work, it’s really cool. I started investigating on speech emotion recognition last year and I was working on a fine-tuned version of this wav2vec2 model when I saw your work, which I founded very interesting (some of your choices were very helpful indeed :clap:).

The truth is that I achieved very good and promising results for an English version and while I was sharing the model on the HF hub, I found very handy and poor generalizable the way others can approach or even use this speech classification models. Do you know if there is something like a Wav2Vec2ForSpeechClassification class or some standarization under development or in the scope? (i.e. similar to the BertForSequenceClassification). I think it will be very useful for this type of audio classification tasks due to the capabilities of Wav2Vec 2.0.

Anyway, congratulations again for your awesome work :tada:

hello i have some problmes with the code if anyone may help
train_dataset =
eval_dataset =
i have the console output
83 if status.IsInvalid():
—> 84 raise ArrowInvalid(message)
85 elif status.IsIOError():
86 # Note: OSError constructor is

ArrowInvalid: Can only convert 1-dimensional array values