I started looking a bit into Confidence Scores / Self-Training for Speech Recognition for models like Wav2Vec2 that make use a language model using pyctcdecode's
library
PyCTCDecode returns a lm_score
which can be seen as the fused score between the acoustic model (Wav2Vec2) and a language model (kenLM). This score is the sum of all per-word fused lm_scores
, so it seems reasonable to normalize the output by the number of words. Also see some questions here:
- confidence scores output from the LM · Issue #57 · kensho-technologies/pyctcdecode · GitHub
- Question about naming of `lm_score` parameter in `decode_logits` · Issue #63 · kensho-technologies/pyctcdecode · GitHub
First, let’s create some Wav2Vec2 + ngram models. We’ll simply add the official 4-gram of Librispeech to the new data2vec models to create the following models:
- patrickvonplaten/data2vec-audio-base-10m-4-gram
- patrickvonplaten/data2vec-audio-base-100h-4-gram
- patrickvonplaten/data2vec-audio-base-960h-4-gram
Now, it’s quite easy to retrieve those lm_scores
and to compute a confidence level this way:
- Import all necessary libraries and load model and tokenizer
from transformers import AutoModelForCTC, AutoProcessor
from datasets import load_dataset
import datasets
import torch
import sys
model_id = "TODO: fill in"
model = AutoModelForCTC.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)
- Load Librispeech dummy data:
num_samples = 4
dataset = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
samples = dataset[:num_samples]
audio_samples = [s["array"] for s in samples["audio"]]
sampling_rate = set([s["sampling_rate"] for s in samples["audio"]]).pop()
text_samples = samples["text"]
- Predict transcription with model:
# process to input_values
inputs = processor(audio_samples, return_tensors="pt", sampling_rate=sampling_rate, padding=True)
# forward inputs to model
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
- Retrieve the per word probability normalized over number of words
output = processor.batch_decode(logits.numpy(), output_word_offsets=True)
confidence_scores = [score / len(t.split(" ")) for score, t in zip(output.lm_score, output.text)]
- Define confidence score the length normalized
lm_score
of the prediction
for i in range(num_samples):
print(20 * "=" + f"Output {i}" + 20 * "=")
print(text_samples[i])
print(f"{output.text[i]}: {confidence_scores[i]}")
print("\n")
Cool let’s run this on the new data2vec audio models:
- patrickvonplaten/data2vec-audio-base-10m-4-gram
- patrickvonplaten/data2vec-audio-base-100h-4-gram
- patrickvonplaten/data2vec-audio-base-960h-4-gram
patrickvonplaten/data2vec-audio-base-10m-4-gram
====================Output 0====================
MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL
MISTER QUILTER IS THE APPOSELE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL:
-2.9550299660242825
====================Output 1====================
NOR IS MISTER QUILTER'S MANNER LESS INTERESTING THAN HIS MATTER
NOR IS MISTER QUILTR'S MANNER LESS INTERESTING THAN HIS MATTER:
-3.8471058156146243
====================Output 2====================
HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND
HE TELLS IS THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CRISMIIS AND ROST BEEF LOOMING BEFORE HIS SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND:
-3.115683062281252
====================Output 3====================
HE HAS GRAVE DOUBTS WHETHER SIR FREDERICK LEIGHTON'S WORK IS REALLY GREEK AFTER ALL AND CAN DISCOVER IN IT BUT LITTLE OF ROCKY ITHACA
HE HAS GRAVED DOUBTS WHETHER SIR FREDERICK LATEN'S WORK IS RELY GREEK AFTER ALL AND CAN DESCOVER IN IT BUT LITTLE OF ROCKY ETHICA:
-4.292775884726897
patrickvonplaten/data2vec-audio-base-100h-4-gram
====================Output 0====================
MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL
MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL:
-1.0723093529710663
====================Output 1====================
NOR IS MISTER QUILTER'S MANNER LESS INTERESTING THAN HIS MATTER
NOR IS MISTER QUILTER'S MANNER LESS INTERESTING THAN HIS MATTER:
-2.6140757339617786
====================Output 2====================
HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND
HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND:
-1.1805021799946347
====================Output 3====================
HE HAS GRAVE DOUBTS WHETHER SIR FREDERICK LEIGHTON'S WORK IS REALLY GREEK AFTER ALL AND CAN DISCOVER IN IT BUT LITTLE OF ROCKY ITHACA
HE HAS GRAVE DOUBTS WHETHER SIR FREDERICK LAYTON'S WORK IS REALLY GREEK AFTER ALL AND CAN DISCOVER IN IT BUT LITTLE OF ROCKY ITHACA EH:
-2.069009737832042
patrickvonplaten/data2vec-audio-base-960h-4-gram
====================Output 0====================
MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL
MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL:
-1.0610139720694658
====================Output 1====================
NOR IS MISTER QUILTER'S MANNER LESS INTERESTING THAN HIS MATTER
NOR IS MISTER QUILTER'S MANNER LESS INTERESTING THAN HIS MATTER R:
-3.11299682252419
====================Output 2====================
HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND
HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND:
-1.147767963941466
====================Output 3====================
HE HAS GRAVE DOUBTS WHETHER SIR FREDERICK LEIGHTON'S WORK IS REALLY GREEK AFTER ALL AND CAN DISCOVER IN IT BUT LITTLE OF ROCKY ITHACA
HE HAS GRAVE DOUBTS WHETHER SIR FREDERICK LEIGHTON'S WORK IS REALLY GREEK AFTER ALL AND CAN DISCOVER IN IT BUT LITTLE OF ROCKY ITHACA:
-1.870571726475313
Alright, this actually seems to make some sense here! The 10m has consistently the lowest score and one can usually say that the “correcter” the sentence the better the score. The 960h model has the best scores for all but Output 1
for which the 100h also gives a better prediction.
This already seems to work quite well, but would need some more experiments.
There are a couple of questions, I’m not sure about:
- Right now the average probability per word is taken, is
min
ormax
maybe better? Also see: confidence scores output from the LM · Issue #57 · kensho-technologies/pyctcdecode · GitHub