Whisper is awesome at producing written transcripts of spoken speech. But spoken speech itself is hard to read - it tends to be excessively verbose, and contain false starts or changes of direction mid-sentence.
What’s the state of the art with regard to editing spoken speech from transcripts into something easier to read?
So far I’ve found paraphrasing tools tend to be too willing to take risks with the meaning or to rewrite more than really necessary.
GPT3 will make confident changes that read nicely but are just wrong. It’s quite hard to get GP3’s edit endpoint to make any change at all.
So I’m looking for a relatively conservative technology that makes small changes where there’s a definite benefit to doing so.
It seems like you would want to use a sequence of models: first an ASR model like Whisper, then a summarization model. (You can do it in a single model, but you’d have to train it from scratch haha)
As you’ve written above, the summarization model may hallucinate. Luckily, you have full control over text generation with HF tools, and we have something that curbs these hallucinations: a repetition penalty/bonus for tokens that are present in the input.
The constant you define there will be multiplied by the token logits and, since the logits are negative values with
0.0 being the maximum “probability”, you want to set
1.0 to promote the use of tokens that are present at the input of the summarization task.
I hope this helps
Thanks @joaogante that is incredibly helpful. You’re right,
encoder_repetition_penalty is going to be exactly what I need to create the kind of faithful editing I have in mind. That really makes me much more optimistic this is achievable.
I’ve had a look at the other settings there, but it’s not clear that there is one that can help with a seperate aspect of the problem: how the model treats words it does not understand.
My text will likely contain technical terms that are important; there will be too many though for me to specify them. I want the model to fairthfully preserve words it does not understand, rather than omit words/phrases it does not understand. Is there a setting that might influence whether it preserves or drops unfamiliar token sequences?
Hey @joaogante in the link you provided to the raw code I see:
self.penalty = 1 / penalty. Wouldn’t this mean that a higher penalty would promote the use of tokens that are present at the input?
I’ve also ran some tests and it seems that a higher penalty gives outputs that are more grounded in the input text. It’s a small sample size so this observation could mean nothing.
It’s very hard to find much about this parameter online so any thoughts & feedback are appreciated.
Hey @ToasterLeavin – just as you wrote, a positive length penalty promotes longer sequences! You can find more information about that parameter (and others) in this doc page
Let me know if it is not clear
Thanks for the reply. Sorry for being unclear in my quetsion, I was asking about