Ideas to correct Wav2Vec2 transcription results

I’m tinkering with the preprocessing (segmentation) to achieve the best Wav2Vec2 transcriptions I can, and am fairly impressed with the results (compared to others like Silero, and previous experience with Sphinx, or alternatives like ELAN).

(I’m finding I probably need to reduce the maximum segment times below 60s and haven’t quite got that down perfectly yet, but that’s beside the point)

However, there are some pretty glaring phonetic transcription mistakes and I’m wondering if there are standard approaches to adjust these (in an automated way, before resorting to manual adjustment).

For instance, I’m seeing “Boric Johnson” (for Boris Johnson, the UK Prime Minister), “ennay chess” (for NHS, the UK’s health service), and “social medeor” (social media).

These are all clearly phonetic ‘guesses’ and would be identifiable as outside the vocabulary of a standard text model: I’m curious if there’s a standard post-processing step that attempts to ‘realign’ such outputs with possible alternatives (which I could perhaps explore in an interface to resolve ambiguous parts).

I’m wondering if anyone knows of the usual approach (or if this is not a standard step, could suggest an innovative approach) using language models — even just the proper terms for what I’m trying to do here would help me research my next steps.

I’d think it was a similar-ish problem to spelling mistakes (for which there’s ‘T5 Sentence Doctor’ for example, but tests aren’t too encouraging). I may be missing a more appropriate alternative I don’t know about so I thought I’d ask the community here.

Thanks, first question on the forum so please let me know if this is off topic, I’ve used the new Wav2Vec2 960h model and may try the other versions next but expect this will apply to all of them.

1 Like


I think you need a language model trained on a large corpora. You can have a look on this issue.