ASR hypotheses rescoring with perplexity score

Hello everyone,

I’m trying to use pretrained left-to-right language model (like the one from GPT) to (re)score several Automatic Speech Recognition (ASR) hypotheses. I tried to compute the perplexity associated to each ASR hypothesis, by adapting the script presented here:

and to use this perplexity to assess which one among several ASR hypotheses is the best. Here is the modified version of the script:

Compute likelihood score for ASR hypotheses. I reused some code coming from:
from transformers import GPT2LMHeadModel, GPT2TokenizerFast
import torch

def main():
    toy_data = ["Do you accept these terms?",
                "Do you accept these terms.",
                "Do you accept these terms",
                "do you accept these terms",
                "Do you accept this terms?",
                "do you accept this terms",
                "Do you accept these germs?",
                "Do you accept these germs.",
                "do you accept these germs",
                "Do you accept this germs?",
                "do you accept this germs",
                "Do you accept these thermos?",
                "do you accept these thermos",
                "Do you accept these conditions?",
                "do you accept these conditions",
                "Do you accept this lulu?",
                "do you accept this lulu",
                "o u akcept dis termz"]

    model_name = "gpt2"

    device = 'cpu'
    model = GPT2LMHeadModel.from_pretrained(model_name).to(device)
    tokenizer = GPT2TokenizerFast.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token

    encodings = tokenizer(toy_data, return_tensors='pt', padding=True, truncation=True)

    max_length = model.config.n_positions# 1024 for distilgpt2
    stride = 1

    ppl = []
    for j in range(len(toy_data)):

    lls = []
    for i in range(1, encodings.input_ids.size(1), stride): # I had to change the start from 0 to 1 because otherwise I was getting none for first output
        begin_loc = max(i + stride - max_length, 0)
        end_loc = min(i + stride, encodings.input_ids.size(1))
        trg_len = end_loc - i  # may be different from stride on last loop
        input_ids = encodings.input_ids[j, begin_loc:end_loc].to(device)
        target_ids = input_ids.clone()
        target_ids[:-trg_len] = -100

        with torch.no_grad():
            outputs = model(input_ids, labels=target_ids)
            log_likelihood = outputs[0] * trg_len


    ppl.append(torch.exp(torch.stack(lls).sum() / end_loc))

for i, string in enumerate(toy_data):
    print(f"{string}: {ppl[i]}")

if __name__ == "__main__":

Here is the output for this script (not that these are test sentences and not real ASR hypotheses):

Do you accept these terms?: 27.66840171813965
Do you accept these terms.: 55.0008430480957
Do you accept these terms: 354.4270324707031
do you accept these terms: 484.396484375
Do you accept this terms?: 40.16838455200195
do you accept this terms: 730.451416015625
Do you accept these germs?: 55.72071075439453
Do you accept these germs.: 117.37071990966797
do you accept these germs: 332.7767028808594
Do you accept this germs?: 89.48542785644531
do you accept this germs: 383.7159423828125
Do you accept these thermos?: 99.58829498291016
do you accept these thermos: 380.8758239746094
Do you accept these conditions?: 41.42526626586914
do you accept these conditions: 654.9488525390625
Do you accept this lulu?: 78.20793914794922
do you accept this lulu: 303.59307861328125
o u akcept dis termz: 3207.991943359375

The problem is that all GPT and GPT2 models have been trained on cased data, most likely with punctuation, as explained here:

but the ASR system that I’m using doesn’t output punctuation or casing. As you can see with the results above, this has a huge impact on the perplexity: “do you accept these germs” becomes more likely than “do you accept these terms”.

So I have two questions:

  1. Do you know any other transformer models more appropriate for this task?
  2. As a broader question, do you know any existing tools or implementation to do something along the lines of this rescoring?

Note: I am not sure that I picked the right forum category. If you think I did a mistake, don’t hesitate to tell me where it should go.

1 Like