Model Suggestion on Text correction

I’m fairly new at NLP but come from a heavy programming background and was looking to for recommendations to find a good starting point for what I’m trying to solve. I have a model/pipeline doing Audio-to-Text transcription for audio that ranges on average from 30 seconds to 1 minute long. It beatifully does a good job transcribing simple audio, but falters on transcribing specialized nomenclature, such engineering terms or slang found in the audio, replacing them with similar sounding words but not the exact word.

One benifit, is that I do have access to tens of thousands of manually transcribed texts and audio files, containing this slang and specialized nomenclature. I had the idea of running the first model through the already transcribed audio to then run the interpreted text through some sort of transformation model so it will then “fix” the sentence to the best of its ability, guessing what I most likely wanted to say based on past history. Keep in mind the manuallly transcribed text has repeatedly the same context, specialized nomenclature, slang, proper-nouns, etc.


How is it going with your toaster models? When will you be archiving to the convention?

When it should say

How is it going with your Schitzer models? When will you be arriving to the convention?

Can any one recommend a model or path I can research or learn about out to accomplish this?