Trying to build scream detection using pretrained model

Context: There was a horrible rape case in Kolkata.

I have been wondering why does “smart” phone need elaborate manual steps to trigger SOS, when it already has enough inputs to detect panic like mic, camera, gps, gyroscope, etc.

I found this model (padmalcom/wav2vec2-large-nonverbalvocalization-classification) that promise to detecting scream. When I ran it on a test screaming audio, I’m getting different result for every run.

Here is the script I’m using:

from transformers import Wav2Vec2ForSequenceClassification

model_name = "padmalcom/wav2vec2-large-nonverbalvocalization-classification"
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name)

import librosa

audio_path = "scream_test.wav"
audio, sample_rate = librosa.load(audio_path, sr=48000)

from scipy.stats import zscore

audio = zscore(audio)

import torch

torch.manual_seed(42)

inputs = torch.tensor(audio).unsqueeze(0)

outputs = model(inputs)
predicted_class_index = torch.argmax(outputs.logits, dim=1).item()
labels = model.config.id2label

print(labels[predicted_class_index])

I also see this warning before output:

Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at padmalcom/wav2vec2-large-nonverbalvocalization-classification and are newly initialized: ['classifier.bias', 'classifier.weight', 'projector.bias', 'projector.weight', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

I’m new here, what am I doing wrong?

from transformers import pipeline

model_name = 'padmalcom/wav2vec2-large-nonverbalvocalization-classification'
classifier = pipeline('audio-classification', model=model_name)

print(classifier("scream_test.wav"))

Above is a simple way to write it, it is still failing with same error.