I started looking a bit into Confidence Scores / Self-Training for Speech Recognition for models like Wav2Vec2.
The most reasonable way of doing so is to do it on a per-word level basis.
With the new output_word_offsets=True
it’s quite easy to retrieve the logits scores of the predicted words. E.g. one could do the following:
- Import all necessary libraries and load model and tokenizer
from transformers import AutoModelForCTC, AutoProcessor
from datasets import load_dataset
import datasets
import torch
import sys
model_id = "TODO: fill in"
model = AutoModelForCTC.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)
- Load Librispeech dummy data:
num_samples = 4
dataset = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
samples = dataset[:num_samples]
audio_samples = [s["array"] for s in samples["audio"]]
sampling_rate = set([s["sampling_rate"] for s in samples["audio"]]).pop()
text_samples = samples["text"]
- Predict transcription with model:
# process to input_values
inputs = processor(audio_samples, return_tensors="pt", sampling_rate=sampling_rate, padding=True)
# forward inputs to model
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
- Compute probabilities (log softmax here) of predicted (argmax logits):
pred_ids = torch.argmax(logits, dim=-1)
scores = torch.nn.functional.log_softmax(logits, dim=-1)
pred_scores = scores.gather(1, pred_ids.unsqueeze(-1))[:, :, 0]
- Retrieve the per word probability normalized over word length
output = processor.batch_decode(pred_ids, output_word_offsets=True)
# add confidence
def confidence_score(word_dict, index):
probs = pred_scores[index, word_dict["start_offset"]: word_dict["end_offset"]]
return round(torch.sum(probs).item() / (len(probs)), 4)
confidence_scores = []
for i in range(num_samples):
confidence_scores.append({d["word"]: confidence_score(d, i) for d in output.word_offsets[i]})
- Define confidence score as minimum word prob
for i in range(num_samples):
print(20 * "=" + f"Output {i}" + 20 * "=")
print(text_samples[i])
print(f"{' '.join(confidence_scores[i].keys())}: {min(confidence_scores[i].values())}")
print("\n")
Cool let’s run this on the new data2vec audio models:
- facebook/data2vec-audio-base-10m · Hugging Face
- facebook/data2vec-audio-base-100h · Hugging Face
- facebook/data2vec-audio-base-960h · Hugging Face
It should be clear that the 960h should have “more” confidence than the 100h model. However the outputs are as follows:
facebook/data2vec-audio-base-10m
====================Output 0====================
MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL
MISTER QUILTER IS THE APPOSELE OF MIDL CLASES AND WHE ER GLAD TO WELCOME HIS GASPLE: -0.5873
====================Output 1====================
NOR IS MISTER QUILTER'S MANNER LESS INTERESTING THAN HIS MATTER
NOR IS MISTERE QUILTR'S MANER LES INTRESTING THAN HIS MATER: -0.4173
====================Output 2====================
HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND
HE TELES IS THAT AT THIS FESTIVE CESON OF THE YEAR WITH CRISMIIS AND ROST BEF LOOMING BEFOR SEIMILIYS DRAWN FROM EATING ITS RESALTS OCARE MOST REDHILY TO MIND: -0.0
====================Output 3====================
HE HAS GRAVE DOUBTS WHETHER SIR FREDERICK LEIGHTON'S WORK IS REALLY GREEK AFTER ALL AND CAN DISCOVER IN IT BUT LITTLE OF ROCKY ITHACA
HE HAS GREAVED DOUBTS WETHER SIR FREDRICK LATEN'S WORK IS RELY GRE AFTER ALL AND CAN DESCOVER IN IT BUT LITTLE OFE ROCKY ETHICA: -0.0006
facebook/data2vec-audio-base-100h
====================Output 0====================
MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL
MISTER QUILTER IS THE APOSTLE OF MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL: -0.7656
====================Output 1====================
NOR IS MISTER QUILTER'S MANNER LESS INTERESTING THAN HIS MATTER
NOR IS MISTER QUILTER'S MANNER LESS INTERESTING THAN HIS MATTER: -0.5057
====================Output 2====================
HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND
HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE SIMILES DRAWN FROM EATING ITS RESULTS OCCUR MOST READILY TO MINE: -0.0
====================Output 3====================
HE HAS GRAVE DOUBTS WHETHER SIR FREDERICK LEIGHTON'S WORK IS REALLY GREEK AFTER ALL AND CAN DISCOVER IN IT BUT LITTLE OF ROCKY ITHACA
HE HAS GRAVE DOUBTS WHETHER SIR FREDERICK LAYTON'S WORK IS REALLY GREEK AFTER ALL AND CAN DISCOVER IN IT BUT LITTLE OF ROCKY ITHICA EH: -0.0
facebook/data2vec-audio-base-960h
====================Output 0====================
MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL
MISTER QUILTER IS THE APOSTLE OF MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL: -0.938
====================Output 1====================
NOR IS MISTER QUILTER'S MANNER LESS INTERESTING THAN HIS MATTER
NOR IS MISTER QUILTER'S MANNER LESS INTERESTING THAN HIS MATTER RR: -0.6415
====================Output 2====================
HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND
HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE SIMILES DRAWN FROM EATING ITS RESULTS OCCUR MOST READILY TO MIND: 0.0
====================Output 3====================
HE HAS GRAVE DOUBTS WHETHER SIR FREDERICK LEIGHTON'S WORK IS REALLY GREEK AFTER ALL AND CAN DISCOVER IN IT BUT LITTLE OF ROCKY ITHACA
HE HAS GRAVE DOUBTS WHETHER SIR FREDERIC LEYHTON'S WORK IS REALLY GREEK AFTER ALL AND CAN DISCOVER IN IT BUT LITTLE OF ROCKY ITHACA: -0.0
Now as can be seen that this doesn’t seem to be too useful. Incorrect text is predicted with very high confidence by the 10m
model and there is no difference between the 960h and the 10m model really at all nor between correctly and incorrectly predicted sentences.
There are a couple of questions, I’m not sure about:
- Is it even possible to do confidence scoring without a language model for ASR?
- Should the minimum (lowest prob) of all words be taken as the confidence of the transcription or the average?
- Should the word prob correspond to a length normalized log_sum or not normalized?