Fine-tuned Wave2Vec2 does not recognise spoken words

Hello,

I’m trying to develop a hot-word detector using HF and pytorch. So far I’ve been studying the documentation and getting an understanding of hugging face.

I’m running the code on my Macbook, using MPS for acceleration. I’ve been following this Notebook Google Colab to get an understanding of the process.

Goals
I’d like to fine-tune wave2vec on the superb/ks dataset, with the addition of one single keyword (my hot-word which is something like “hey assistant”).

The application is expected to recognise words being said.
I have created a simple app to listen on the mic and pass the captured data to the model, tested with an online pre-trained model (MIT/ast-finetuned-speech-commands-v2) to ensure it’s functioning properly.

Problem
When I run the application using my custom fine-tuned model, the output is continuously printing “unknown”.

I just don’t understand what’s wrong here.

What I’ve tried

Below the 3 files I have created for 1) dataset creation (for my custom wake word), 2) fine-tuning/training and 3) run the inference.

My dataset is prepared only for the wake word, which later I will combine with the superb/kb.

# prepare_dataset.py

import os
from datasets import Dataset, Audio, ClassLabel

def load_wake_word_files(directory):
    """Generate a dataset compatible with superb/ks.
    """
    data = []
    for filename in os.listdir(directory):
        if filename.endswith(".wav"):
            file = os.path.join(directory, filename)
            data.append({"file": file, "audio": file, "label": "hey_assistant"})  # Label 1 for wake words
    
    print(f"{len(data)} files found")
    return data

# Path to your wake words directory
wake_word_dir = "/Users/jah/code/assistant/py/hotword/data/hey_assistant/"

# Create the dataset for wake words
wake_word_data = load_wake_word_files(wake_word_dir)
dataset = Dataset.from_dict({"file": [item["file"] for item in wake_word_data], "audio": [item["audio"] for item in wake_word_data], "label": [item["label"] for item in wake_word_data]})
# Cast to Audio type, opening each file
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
dataset = dataset.cast_column("label", ClassLabel(names=["yes", "no", "up", "down", "left", "right", "on", "off", "stop", "go", "_silence_", "_unknown_", "hey_assistant"]))

# Assuming a 80-20 train-test split
split_dataset = dataset.train_test_split(test_size=0.2)

# Save the dataset
split_dataset.save_to_disk("/Users/jah/code/assistant/py/hotword/data/dataset_wave2vec2/")

print("Wake word dataset created and saved to disk.")

Here I combine the 2 datasets, extending labels for superb to include my custom wake word.

# fine_tuning.py
"""
Reference doc: https://colab.research.google.com/github/huggingface/notebooks/blob/main/examples/audio_classification.ipynb#scrollTo=UuyXDtQqNUZW
"""
from datasets import load_dataset, load_metric, load_from_disk, ClassLabel, concatenate_datasets
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification, TrainingArguments, Trainer
import numpy as np
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if torch.backends.mps.is_available() and torch.backends.mps.is_built():
    device = torch.device("mps")

# data testing purposes
import random
from IPython.display import Audio, display

model_checkpoint = "facebook/wav2vec2-base"
batch_size = 32

superb_train = load_dataset("superb", "ks", split="train")
superb_testing = load_dataset("superb", "ks", split="test")
validation_dataset = load_dataset("superb", "ks", split="validation")
metric = load_metric("accuracy")

dataset_path = "/Users/jah/code/assistant/py/hotword/data/dataset_wave2vec2/"
wake_word_dataset = load_from_disk(dataset_path)

# Cast the datasets

# training:
casted_superb_train_features = superb_train.features.copy()
casted_superb_train_features["label"] = ClassLabel(names=["yes", "no", "up", "down", "left", "right", "on", "off", "stop", "go", "_silence_", "_unknown_", "hey_assistant"])
casted_superb_train = superb_train.cast(casted_superb_train_features)

train_combined_dataset = concatenate_datasets([casted_superb_train, wake_word_dataset["train"]])

# testing:
casted_superb_testing_features = superb_testing.features.copy()
casted_superb_testing_features["label"] = ClassLabel(names=["yes", "no", "up", "down", "left", "right", "on", "off", "stop", "go", "_silence_", "_unknown_", "hey_assistant"])
casted_superb_testing = superb_testing.cast(casted_superb_testing_features)

testing_combined_dataset = concatenate_datasets([casted_superb_testing, wake_word_dataset["test"]])

labels = train_combined_dataset.features["label"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

feature_extractor = AutoFeatureExtractor.from_pretrained(model_checkpoint)

max_duration = 1.0  # seconds

def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays, 
        sampling_rate=feature_extractor.sampling_rate, 
        max_length=int(feature_extractor.sampling_rate * max_duration), 
        truncation=True, 
    )
    return inputs

train_encoded_dataset = train_combined_dataset.map(preprocess_function, remove_columns=["audio", "file"], batched=True)
validation_encoded_dataset = validation_dataset.map(preprocess_function, remove_columns=["audio", "file"], batched=True)

num_labels = len(id2label)
model = AutoModelForAudioClassification.from_pretrained(
    model_checkpoint, 
    num_labels=num_labels,
    label2id=label2id,
    id2label=id2label,
)

model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    output_dir=f'/Users/jah/code/assistant/py/hotword/data/results/{model_name}-finetuned-ks',
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=3e-5,
    per_device_train_batch_size=batch_size,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    warmup_ratio=0.1,
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    push_to_hub=False,
    use_mps_device=True, # deprecated, added for consistency
    use_cpu=False,
)

def compute_metrics(eval_pred):
    """Computes accuracy on a batch of predictions"""
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return metric.compute(predictions=predictions, references=eval_pred.label_ids)

trainer = Trainer(
    model,
    args,
    train_dataset=train_encoded_dataset,
    eval_dataset=validation_encoded_dataset,
    tokenizer=feature_extractor,
    compute_metrics=compute_metrics
)

print("Training device:", trainer.args.device)

# this will use the pytorch device set above
trainer.train()
trainer.evaluate()

Finally, this is the app that should be using the trained model:

# app.py

from transformers import pipeline
from transformers.pipelines.audio_utils import ffmpeg_microphone_live
import torch

# Simplified device assignment
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
if torch.backends.mps.is_available() and torch.backends.mps.is_built():
    device = torch.device("mps")

print(f"Device: {device}")

# Load classifier
classifier = pipeline(
    "audio-classification", 
    model="/Users/jah/code/assistant/py/hotword/data/results/wav2vec2-base-finetuned-ks/checkpoint-2150", 
    device=device, 
    local_files_only=True
)

print(classifier.model.config.id2label)

def launch_fn(
    wake_word="hey_assistant",
    prob_threshold=0.5,
    chunk_length_s=2.0,
    stream_chunk_s=1.0,
    debug=False,
):
    if wake_word not in classifier.model.config.label2id.keys():
        raise ValueError(
            f"Wake word {wake_word} not in set of valid class labels, pick a wake word in the set {classifier.model.config.label2id.keys()}."
        )

    sampling_rate = classifier.feature_extractor.sampling_rate
    mic = ffmpeg_microphone_live(
        sampling_rate=sampling_rate,
        chunk_length_s=chunk_length_s,
        stream_chunk_s=stream_chunk_s,
    )

    print("Listening for wake word...")
    for prediction in classifier(mic):
        prediction = prediction[0]
        if debug:
            print(f"Prediction: {prediction}")

        if prediction["label"] == wake_word and prediction["score"] > prob_threshold:
            print("Wake word detected!")
            return True
        else:
            print("Wake word not detected, continuing...")

if __name__ == "__main__":
    launch_fn(debug=True)

Unfortunately it only prints output like:

Prediction: {'score': 0.13256900012493134, 'label': '_unknown_'} Wake word not detected, continuing...

I checked the labels and they are all present, however none of the words work, not even the ones coming from superb/ks

Custom dataset
The custom dataset has around 6k samples of the word, with audio samples ranging from 1 to 3 seconds.

Wrapping up
Any help would be greatly appreciated. It took me over 7 hours to train the model, and I’ve spent weeks on this already - I don’t seem to be able to locate a notebook going from end to end and therefore get to a working solution which is depressing :frowning:

The output of the training ends with:

{'loss': 2.2602, 'grad_norm': 201870911995904.0, 'learning_rate': 1.5503875968992249e-07, 'epoch': 4.97}                                                                                                                                                      
{'loss': 2.2482, 'grad_norm': 281655633772544.0, 'learning_rate': 0.0, 'epoch': 5.0}                                                                                                                                                                          
{'eval_loss': 2.2558882236480713, 'eval_accuracy': 0.6209179170344219, 'eval_runtime': 129.4695, 'eval_samples_per_second': 52.507, 'eval_steps_per_second': 1.645, 'epoch': 5.0}                                                                             
{'train_runtime': 20807.897, 'train_samples_per_second': 13.232, 'train_steps_per_second': 0.103, 'train_loss': 2.2983862703900004, 'epoch': 5.0}                                                                                                             

I could notice the grad_norm was fluctuating a lot, not sure this is signifying issues with the training itself.