Hi there, I am currently using open AI’s Whisper model to perform speech to text transcription. Much of my input data comes from language and terms that are specific to a particular industry/domain - biology. As such, when I have audio that contains biology terms, the model does not always transcribe those correctly, sometimes not even transcribing them and skipping over them. The obvious solution to this seems to be to finetune Whisper on my dataset and improve it that way: Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers, or potentially using a new tokenizer (then I would lose Whisper’s pretrained weights). But I do not have access to a large training corpus of both the necessary biological terms and their audio files.
So instead, I was thinking of exploring a domain specific attention initialization approach where I would initialize the attention with prior information about medical terminology? Since the source of my problem/task is in the decoder part of Whisper’s transformer, I was thinking of maybe fixing the attention part of its architecture. Please correct me if I am wrong and if there is another way. But if this is the right way to go about this problem, I am wondering how I would do this.
@sanchit-gandhi do you maybe have input on this?