Context: There was a horrible rape case in Kolkata.
I have been wondering why does “smart” phone need elaborate manual steps to trigger SOS, when it already has enough inputs to detect panic like mic, camera, gps, gyroscope, etc.
I found this model (padmalcom/wav2vec2-large-nonverbalvocalization-classification) that promise to detecting scream. When I ran it on a test screaming audio, I’m getting different result for every run.
Here is the script I’m using:
from transformers import Wav2Vec2ForSequenceClassification
model_name = "padmalcom/wav2vec2-large-nonverbalvocalization-classification"
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name)
import librosa
audio_path = "scream_test.wav"
audio, sample_rate = librosa.load(audio_path, sr=48000)
from scipy.stats import zscore
audio = zscore(audio)
import torch
torch.manual_seed(42)
inputs = torch.tensor(audio).unsqueeze(0)
outputs = model(inputs)
predicted_class_index = torch.argmax(outputs.logits, dim=1).item()
labels = model.config.id2label
print(labels[predicted_class_index])
I also see this warning before output:
Some weights of Wav2Vec2ForSequenceClassification were not initialized from the model checkpoint at padmalcom/wav2vec2-large-nonverbalvocalization-classification and are newly initialized: ['classifier.bias', 'classifier.weight', 'projector.bias', 'projector.weight', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
I’m new here, what am I doing wrong?