I have some data where I have audio and human annotated summaries of the audio, I do not however, have ground truth transcripts. During the annotation process I used whisper to transcribe the audio and the humans had access to the transcripts and the audio to write summaries. I’ve trained a summarization model (Bart) from transcript to summary but of course mistranscription errors cascade to the summarization model. Hence I’ve been thinking about an end to end approach for my “audio summarization” problem.
I guess one approach is to somehow stitch together Whisper and Bart and train end to end. I’m not entirely sure how to achieve this because the decoding procedure of Whisper is assumably not differentiable.
Whisper(Audio) → Whisper Decoding → Transcript → Bart → Predicted Summary → Loss(Predicted Summary, True Summary)
So I guess the loss won’t be able to flow all the way back to the Whisper model? Not sure if there are any tricks to getting something like this to work, some tips/tricks would be appreciated if there is
- Teaching Whisper a new task of Summarization
Whisper right now is aware of two tasks “translate” and “transcribe”. I was thinking why not just try to teach it a new task of “summarize”. As far as I understand Whisper’s decoder has just been pretrained with <|translate|> and <|transcribe|> prefixes and I would just finetune with <|summarize|>. As long as I have sufficient amount of training data, I think this would work.
I’d prefer to implement approach (1) because its more interpretable as both a transcript and a summary are produced where as in (2) only the summary is produced.
I’m curious to hear if these approaches are viable and any tips/tricks to make (1) work if possible