Confidence Scores / Self-Training for Wav2Vec2 / CTC models With LM (PyCTCDecode)

I started looking a bit into Confidence Scores / Self-Training for Speech Recognition for models like Wav2Vec2 that make use a language model using pyctcdecode's library

PyCTCDecode returns a lm_score which can be seen as the fused score between the acoustic model (Wav2Vec2) and a language model (kenLM). This score is the sum of all per-word fused lm_scores, so it seems reasonable to normalize the output by the number of words. Also see some questions here:

First, let’s create some Wav2Vec2 + ngram models. We’ll simply add the official 4-gram of Librispeech to the new data2vec models to create the following models:

  • patrickvonplaten/data2vec-audio-base-10m-4-gram
  • patrickvonplaten/data2vec-audio-base-100h-4-gram
  • patrickvonplaten/data2vec-audio-base-960h-4-gram

Now, it’s quite easy to retrieve those lm_scores and to compute a confidence level this way:

  1. Import all necessary libraries and load model and tokenizer
from transformers import AutoModelForCTC, AutoProcessor
from datasets import load_dataset                                                                                      
import datasets                                                                                                        
import torch
import sys                
                                                           
model_id = "TODO: fill in"                                                                                            

model = AutoModelForCTC.from_pretrained(model_id)         
processor = AutoProcessor.from_pretrained(model_id)
  1. Load Librispeech dummy data:
num_samples = 4

dataset = load_dataset("patrickvonplaten/librispeech_asr_dummy", "clean", split="validation")
samples = dataset[:num_samples]                                                                                              
audio_samples = [s["array"] for s in samples["audio"]]
sampling_rate = set([s["sampling_rate"] for s in samples["audio"]]).pop() 
text_samples = samples["text"]
  1. Predict transcription with model:
# process to input_values                  
inputs = processor(audio_samples, return_tensors="pt", sampling_rate=sampling_rate, padding=True)

# forward inputs to model                                                                                              
with torch.no_grad():                                                                                                  
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
  1. Retrieve the per word probability normalized over number of words
output = processor.batch_decode(logits.numpy(), output_word_offsets=True)
confidence_scores = [score / len(t.split(" ")) for score, t in zip(output.lm_score, output.text)]
  1. Define confidence score the length normalized lm_score of the prediction
for i in range(num_samples):
    print(20 * "=" + f"Output {i}" + 20 * "=")
    print(text_samples[i])
    print(f"{output.text[i]}: {confidence_scores[i]}")
    print("\n")

Cool let’s run this on the new data2vec audio models:

  • patrickvonplaten/data2vec-audio-base-10m-4-gram
  • patrickvonplaten/data2vec-audio-base-100h-4-gram
  • patrickvonplaten/data2vec-audio-base-960h-4-gram
  1. patrickvonplaten/data2vec-audio-base-10m-4-gram
====================Output 0====================                               
MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL                                                                     
MISTER QUILTER IS THE APPOSELE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL: 
-2.9550299660242825                                               


====================Output 1====================                               
NOR IS MISTER QUILTER'S MANNER LESS INTERESTING THAN HIS MATTER                                                                                               
NOR IS MISTER QUILTR'S MANNER LESS INTERESTING THAN HIS MATTER: 
-3.8471058156146243                                                                           


====================Output 2====================                               
HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND                                                                                                                                                 
HE TELLS IS THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CRISMIIS AND ROST BEEF LOOMING BEFORE HIS SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND:
 -3.115683062281252                                                                                                                              


====================Output 3====================                               
HE HAS GRAVE DOUBTS WHETHER SIR FREDERICK LEIGHTON'S WORK IS REALLY GREEK AFTER ALL AND CAN DISCOVER IN IT BUT LITTLE OF ROCKY ITHACA                                                                                                                                                                                        
HE HAS GRAVED DOUBTS WHETHER SIR FREDERICK LATEN'S WORK IS RELY GREEK AFTER ALL AND CAN DESCOVER IN IT BUT LITTLE OF ROCKY ETHICA:
 -4.292775884726897  
  1. patrickvonplaten/data2vec-audio-base-100h-4-gram
====================Output 0====================                                                                                                                                                                                                                                                                             
MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL                                                                                                                                                                                                                                    
MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL:
 -1.0723093529710663                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                              
====================Output 1====================                                                                                                                                                                                                                                                                             
NOR IS MISTER QUILTER'S MANNER LESS INTERESTING THAN HIS MATTER                                                                                                                                                                                                                                                              
NOR IS MISTER QUILTER'S MANNER LESS INTERESTING THAN HIS MATTER: 
-2.6140757339617786                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                              
====================Output 2====================                                                                                                              
HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND                                                                                                                                                 
HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND: 
-1.1805021799946347                                                                                                                            
                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                              
====================Output 3====================                                                                                                              
HE HAS GRAVE DOUBTS WHETHER SIR FREDERICK LEIGHTON'S WORK IS REALLY GREEK AFTER ALL AND CAN DISCOVER IN IT BUT LITTLE OF ROCKY ITHACA                                                                                                                                                                                        
HE HAS GRAVE DOUBTS WHETHER SIR FREDERICK LAYTON'S WORK IS REALLY GREEK AFTER ALL AND CAN DISCOVER IN IT BUT LITTLE OF ROCKY ITHACA EH: 
-2.069009737832042
  1. patrickvonplaten/data2vec-audio-base-960h-4-gram
====================Output 0====================                                                                                                                                                                                                                                                                             
MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL                                                                                                                                                                                                                                    
MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL:
-1.0610139720694658                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                             
====================Output 1====================                                                                                                              
NOR IS MISTER QUILTER'S MANNER LESS INTERESTING THAN HIS MATTER                                                                                               
NOR IS MISTER QUILTER'S MANNER LESS INTERESTING THAN HIS MATTER R:
 -3.11299682252419                                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                              
====================Output 2====================                                                                                                                                                                                                                                                                             
HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND                                                                                                                                                 
HE TELLS US THAT AT THIS FESTIVE SEASON OF THE YEAR WITH CHRISTMAS AND ROAST BEEF LOOMING BEFORE US SIMILES DRAWN FROM EATING AND ITS RESULTS OCCUR MOST READILY TO THE MIND:
 -1.147767963941466                                                                                                                             
                                                                                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                             
====================Output 3====================                                                                                                                                                                                                                                                                             
HE HAS GRAVE DOUBTS WHETHER SIR FREDERICK LEIGHTON'S WORK IS REALLY GREEK AFTER ALL AND CAN DISCOVER IN IT BUT LITTLE OF ROCKY ITHACA                                                                                                                                                                                        
HE HAS GRAVE DOUBTS WHETHER SIR FREDERICK LEIGHTON'S WORK IS REALLY GREEK AFTER ALL AND CAN DISCOVER IN IT BUT LITTLE OF ROCKY ITHACA:
 -1.870571726475313 

Alright, this actually seems to make some sense here! The 10m has consistently the lowest score and one can usually say that the “correcter” the sentence the better the score. The 960h model has the best scores for all but Output 1 for which the 100h also gives a better prediction.

This already seems to work quite well, but would need some more experiments.

There are a couple of questions, I’m not sure about:

2 Likes

Also tried it out on a “out-of-distribution” dataset - the English version of Common Voice and it still seems to work quite well.

So changing the above 2th point “Load librispeech dummy data” to the following code that loads common voice data:

dataset = load_dataset("common_voice", "en", split="test", streaming=True)
dataset = dataset.cast_column("audio", datasets.Audio(sampling_rate=16_000))

# iterate over dataset
dataset_iter = iter(dataset)
samples = [next(dataset_iter) for _ in range(num_samples)]

audio_samples = [s["audio"]["array"] for s in samples]
sampling_rate = set([s["audio"]["sampling_rate"] for s in samples]).pop()
text_samples = [s["sentence"] for s in samples]

And then running the script again gives the following results:

  1. patrickvonplaten/data2vec-audio-base-10m-4-gram
====================Output 0====================
It was the time of day when all of Spain slept during the summer.
IT WAS THE TIME OF DAY LEVERS BEN SLEPT DURING THE SUMMER:
-3.5796514559110606


====================Output 1====================
Same way you did.
THE SAME POINT: 
-6.560971691113143


====================Output 2====================
Sarah told him that she was there to see her brother.
BUT I TOLD HIM THAT SHE WAS IN TO SEE HER BROTHER: 
-1.249188184327079


====================Output 3====================
Galileo Galilei was the first man who observed the planet Neptune through his telescope.
CALLILI GALLI WAS A FRESHMAN WHO ABSORVES TO PLANT NAPS THOUGH HIS TELICSCOP: 
-7.170448685148719
  1. patrickvonplaten/data2vec-audio-base-100h-4-gram
====================Output 0====================
It was the time of day when all of Spain slept during the summer.
IT WAS THE TIME OF DAY WHEN OLIVE'S PEN SLEPT DURING THE SUMMER: 
-1.724733290751429


====================Output 1====================
Same way you did.
THE SAME DIN YOU TIED: 
-11.673662061158192


====================Output 2====================
Sarah told him that she was there to see her brother.
THERE I TOLD HIM THAT SHE WAS HERE TO SEE HER BROTHER: 
-1.3407323223953858


====================Output 3====================
Galileo Galilei was the first man who observed the planet Neptune through his telescope.
GALILEO GALILEI WAS A FRESHMAN WHO OBSERVES THE PLANT NUPKINS THROUGH HIS TELECSCOPE: 
-5.179441703647934
  1. patrickvonplaten/data2vec-audio-base-960h-4-gram
====================Output 0====================
It was the time of day when all of Spain slept during the summer.
IT WAS THE TIME OF DAY WHEN OLIVER BEN SLEPT DURING THE SUMMER: 
-1.4758548315739513


====================Output 1====================
Same way you did.
THE BLIND YOU IN IT: 
-8.845217131011449


====================Output 2====================
Sarah told him that she was there to see her brother.
BUT I TOLD HIM THAT SHE WAS HERE TO SEE HER BROTHER: 
-1.3983698052694178


====================Output 3====================
Galileo Galilei was the first man who observed the planet Neptune through his telescope.
GALILEO GALIDI WAS THE FIRST MAN WHO OBSERVES TO PLAN NAPTHA THROUGH HIS TELECOSCOPE: 
-4.983984955432581

So the numbers here still seem to be very reasonable. Everything over -3, is quite wrong indeed and things are starting to look better below -2