I wanted to transcribe dictation where the person say words like “paragraph”, “comma”, “period”. The original models did so, so with that, more often than not, mangling the dictation commands.
I created a dataset of about 1-2,000 examples (I took recordings and split them with Sillero VAD in chunks of 30s or less) and labeled them manually using Prodigy. Then I used the dataset to fine-tune the small.en model and got exceptional results.
This is similar to your case - words that the model was not originally particularly good at transcribing.