Confidence Scores / Self-Training for Wav2Vec2 / CTC models With LM (PyCTCDecode)

patrickvonplaten · April 21, 2022, 11:13am

I started looking a bit into Confidence Scores / Self-Training for Speech Recognition for models like Wav2Vec2 that make use a language model using pyctcdecode's library

PyCTCDecode returns a lm_score which can be seen as the fused score between the acoustic model (Wav2Vec2) and a language model (kenLM). This score is the sum of all per-word fused lm_scores, so it seems reasonable to normalize the output by the number of words. Also see some questions here:

confidence scores output from the LM · Issue #57 · kensho-technologies/pyctcdecode · GitHub
Question about naming of `lm_score` parameter in `decode_logits` · Issue #63 · kensho-technologies/pyctcdecode · GitHub

First, let’s create some Wav2Vec2 + ngram models. We’ll simply add the official 4-gram of Librispeech to the new data2vec models to create the following models:

patrickvonplaten/data2vec-audio-base-10m-4-gram
patrickvonplaten/data2vec-audio-base-100h-4-gram
patrickvonplaten/data2vec-audio-base-960h-4-gram

Now, it’s quite easy to retrieve those lm_scores and to compute a confidence level this way:

Import all necessary libraries and load model and tokenizer

from transformers import AutoModelForCTC, AutoProcessor
from datasets import load_dataset                                                                                      
import datasets                                                                                                        
import torch
import sys                
                                                           
model_id = "TODO: fill in"                                                                                            

model = AutoModelForCTC.from_pretrained(model_id)         
processor = AutoProcessor.from_pretrained(model_id)

Load Librispeech dummy data:

num_samples = 4

dataset = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
samples = dataset[:num_samples]                                                                                              
audio_samples = [s["array"] for s in samples["audio"]]
sampling_rate = set([s["sampling_rate"] for s in samples["audio"]]).pop() 
text_samples = samples["text"]

Predict transcription with model:

# process to input_values                  
inputs = processor(audio_samples, return_tensors="pt", sampling_rate=sampling_rate, padding=True)

# forward inputs to model                                                                                              
with torch.no_grad():                                                                                                  
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

Retrieve the per word probability normalized over number of words

output = processor.batch_decode(logits.numpy(), output_word_offsets=True)
confidence_scores = [score / len(t.split(" ")) for score, t in zip(output.lm_score, output.text)]

Define confidence score the length normalized lm_score of the prediction

for i in range(num_samples):
    print(20 * "=" + f"Output {i}" + 20 * "=")
    print(text_samples[i])
    print(f"{output.text[i]}: {confidence_scores[i]}")
    print("\n")

Cool let’s run this on the new data2vec audio models:

patrickvonplaten/data2vec-audio-base-10m-4-gram
patrickvonplaten/data2vec-audio-base-100h-4-gram
patrickvonplaten/data2vec-audio-base-960h-4-gram

patrickvonplaten/data2vec-audio-base-10m-4-gram

====================Output 0====================                               
MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL                                                                     
MISTER QUILTER IS THE APPOSELE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL: 
-2.9550299660242825                                               


====================Output 1====================                               
NOR IS MISTER QUILTER'S MANNER LESS INTERESTING THAN HIS MATTER                                                                                               
NOR IS MISTER QUILTR'S MANNER LESS INTERESTING THAN HIS MATTER: 
-3.8471058156146243                                                                           


====================Output 2====================                               
HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND                                                                                                                                                 
HE TELLS IS THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CRISMIIS AND ROST BEEF LOOMING BEFORE HIS SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND:
 -3.115683062281252                                                                                                                              


====================Output 3====================                               
HE HAS GRAVE DOUBTS WHETHER SIR FREDERICK LEIGHTON'S WORK IS REALLY GREEK AFTER ALL AND CAN DISCOVER IN IT BUT LITTLE OF ROCKY ITHACA                                                                                                                                                                                        
HE HAS GRAVED DOUBTS WHETHER SIR FREDERICK LATEN'S WORK IS RELY GREEK AFTER ALL AND CAN DESCOVER IN IT BUT LITTLE OF ROCKY ETHICA:
 -4.292775884726897

patrickvonplaten/data2vec-audio-base-100h-4-gram

====================Output 0====================                                                                                                                                                                                                                                                                             
MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL                                                                                                                                                                                                                                    
MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL:
 -1.0723093529710663                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                              
====================Output 1====================                                                                                                                                                                                                                                                                             
NOR IS MISTER QUILTER'S MANNER LESS INTERESTING THAN HIS MATTER                                                                                                                                                                                                                                                              
NOR IS MISTER QUILTER'S MANNER LESS INTERESTING THAN HIS MATTER: 
-2.6140757339617786                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                              
====================Output 2====================                                                                                                              
HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND                                                                                                                                                 
HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND: 
-1.1805021799946347                                                                                                                            
                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                              
====================Output 3====================                                                                                                              
HE HAS GRAVE DOUBTS WHETHER SIR FREDERICK LEIGHTON'S WORK IS REALLY GREEK AFTER ALL AND CAN DISCOVER IN IT BUT LITTLE OF ROCKY ITHACA                                                                                                                                                                                        
HE HAS GRAVE DOUBTS WHETHER SIR FREDERICK LAYTON'S WORK IS REALLY GREEK AFTER ALL AND CAN DISCOVER IN IT BUT LITTLE OF ROCKY ITHACA EH: 
-2.069009737832042

patrickvonplaten/data2vec-audio-base-960h-4-gram

====================Output 0====================                                                                                                                                                                                                                                                                             
MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL                                                                                                                                                                                                                                    
MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL:
-1.0610139720694658                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                             
====================Output 1====================                                                                                                              
NOR IS MISTER QUILTER'S MANNER LESS INTERESTING THAN HIS MATTER                                                                                               
NOR IS MISTER QUILTER'S MANNER LESS INTERESTING THAN HIS MATTER R:
 -3.11299682252419                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                              
====================Output 2====================                                                                                                                                                                                                                                                                             
HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND                                                                                                                                                 
HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND:
 -1.147767963941466                                                                                                                             
                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                             
====================Output 3====================                                                                                                                                                                                                                                                                             
HE HAS GRAVE DOUBTS WHETHER SIR FREDERICK LEIGHTON'S WORK IS REALLY GREEK AFTER ALL AND CAN DISCOVER IN IT BUT LITTLE OF ROCKY ITHACA                                                                                                                                                                                        
HE HAS GRAVE DOUBTS WHETHER SIR FREDERICK LEIGHTON'S WORK IS REALLY GREEK AFTER ALL AND CAN DISCOVER IN IT BUT LITTLE OF ROCKY ITHACA:
 -1.870571726475313

Alright, this actually seems to make some sense here! The 10m has consistently the lowest score and one can usually say that the “correcter” the sentence the better the score. The 960h model has the best scores for all but Output 1 for which the 100h also gives a better prediction.

This already seems to work quite well, but would need some more experiments.

There are a couple of questions, I’m not sure about:

Right now the average probability per word is taken, is min or max maybe better? Also see: confidence scores output from the LM · Issue #57 · kensho-technologies/pyctcdecode · GitHub

patrickvonplaten · April 21, 2022, 1:25pm

Also tried it out on a “out-of-distribution” dataset - the English version of Common Voice and it still seems to work quite well.

So changing the above 2th point “Load librispeech dummy data” to the following code that loads common voice data:

dataset = load_dataset("common_voice", "en", split="test", streaming=True)
dataset = dataset.cast_column("audio", datasets.Audio(sampling_rate=16_000))

# iterate over dataset
dataset_iter = iter(dataset)
samples = [next(dataset_iter) for _ in range(num_samples)]

audio_samples = [s["audio"]["array"] for s in samples]
sampling_rate = set([s["audio"]["sampling_rate"] for s in samples]).pop()
text_samples = [s["sentence"] for s in samples]

And then running the script again gives the following results:

patrickvonplaten/data2vec-audio-base-10m-4-gram

====================Output 0====================
It was the time of day when all of Spain slept during the summer.
IT WAS THE TIME OF DAY LEVERS BEN SLEPT DURING THE SUMMER:
-3.5796514559110606


====================Output 1====================
Same way you did.
THE SAME POINT: 
-6.560971691113143


====================Output 2====================
Sarah told him that she was there to see her brother.
BUT I TOLD HIM THAT SHE WAS IN TO SEE HER BROTHER: 
-1.249188184327079


====================Output 3====================
Galileo Galilei was the first man who observed the planet Neptune through his telescope.
CALLILI GALLI WAS A FRESHMAN WHO ABSORVES TO PLANT NAPS THOUGH HIS TELICSCOP: 
-7.170448685148719

patrickvonplaten/data2vec-audio-base-100h-4-gram

====================Output 0====================
It was the time of day when all of Spain slept during the summer.
IT WAS THE TIME OF DAY WHEN OLIVE'S PEN SLEPT DURING THE SUMMER: 
-1.724733290751429


====================Output 1====================
Same way you did.
THE SAME DIN YOU TIED: 
-11.673662061158192


====================Output 2====================
Sarah told him that she was there to see her brother.
THERE I TOLD HIM THAT SHE WAS HERE TO SEE HER BROTHER: 
-1.3407323223953858


====================Output 3====================
Galileo Galilei was the first man who observed the planet Neptune through his telescope.
GALILEO GALILEI WAS A FRESHMAN WHO OBSERVES THE PLANT NUPKINS THROUGH HIS TELECSCOPE: 
-5.179441703647934

patrickvonplaten/data2vec-audio-base-960h-4-gram

====================Output 0====================
It was the time of day when all of Spain slept during the summer.
IT WAS THE TIME OF DAY WHEN OLIVER BEN SLEPT DURING THE SUMMER: 
-1.4758548315739513


====================Output 1====================
Same way you did.
THE BLIND YOU IN IT: 
-8.845217131011449


====================Output 2====================
Sarah told him that she was there to see her brother.
BUT I TOLD HIM THAT SHE WAS HERE TO SEE HER BROTHER: 
-1.3983698052694178


====================Output 3====================
Galileo Galilei was the first man who observed the planet Neptune through his telescope.
GALILEO GALIDI WAS THE FIRST MAN WHO OBSERVES TO PLAN NAPTHA THROUGH HIS TELECOSCOPE: 
-4.983984955432581

So the numbers here still seem to be very reasonable. Everything over -3, is quite wrong indeed and things are starting to look better below -2

Topic		Replies	Views
Confidence Scores / Self-Training for Wav2Vec2 / CTC models Research	1	3712	April 21, 2022
Language model for wav2vec2.0 decoding Models	36	13917	August 3, 2024
Confidence Score For Wav2Vec? 🤗Transformers	0	243	August 8, 2022
ML for Audio Study Group - pyctcdecode (Jan 18) Community Calls	10	1829	January 18, 2022
Original and re-loaded model are not the same Beginners	0	464	August 14, 2021

Confidence Scores / Self-Training for Wav2Vec2 / CTC models With LM (PyCTCDecode)

Related topics