What's the state of the art with spoken transcript editing?

jonathanjfshaw · February 26, 2023, 1:52pm

Whisper is awesome at producing written transcripts of spoken speech. But spoken speech itself is hard to read - it tends to be excessively verbose, and contain false starts or changes of direction mid-sentence.

What’s the state of the art with regard to editing spoken speech from transcripts into something easier to read?

So far I’ve found paraphrasing tools tend to be too willing to take risks with the meaning or to rewrite more than really necessary.

GPT3 will make confident changes that read nicely but are just wrong. It’s quite hard to get GP3’s edit endpoint to make any change at all.

So I’m looking for a relatively conservative technology that makes small changes where there’s a definite benefit to doing so.

joaogante · March 1, 2023, 4:52pm

Hey @jonathanjfshaw

It seems like you would want to use a sequence of models: first an ASR model like Whisper, then a summarization model. (You can do it in a single model, but you’d have to train it from scratch haha)

As you’ve written above, the summarization model may hallucinate. Luckily, you have full control over text generation with HF tools, and we have something that curbs these hallucinations: a repetition penalty/bonus for tokens that are present in the input.

The constant you define there will be multiplied by the token logits and, since the logits are negative values with 0.0 being the maximum “probability”, you want to set encoder_repetition_penalty between 0.0 and 1.0 to promote the use of tokens that are present at the input of the summarization task.

I hope this helps

jonathanjfshaw · March 3, 2023, 1:36pm

Thanks @joaogante that is incredibly helpful. You’re right, encoder_repetition_penalty is going to be exactly what I need to create the kind of faithful editing I have in mind. That really makes me much more optimistic this is achievable.

I’ve had a look at the other settings there, but it’s not clear that there is one that can help with a seperate aspect of the problem: how the model treats words it does not understand.

My text will likely contain technical terms that are important; there will be too many though for me to specify them. I want the model to fairthfully preserve words it does not understand, rather than omit words/phrases it does not understand. Is there a setting that might influence whether it preserves or drops unfamiliar token sequences?

ToasterLeavin · June 7, 2023, 3:24pm

Hey @joaogante in the link you provided to the raw code I see: self.penalty = 1 / penalty. Wouldn’t this mean that a higher penalty would promote the use of tokens that are present at the input?

I’ve also ran some tests and it seems that a higher penalty gives outputs that are more grounded in the input text. It’s a small sample size so this observation could mean nothing.

It’s very hard to find much about this parameter online so any thoughts & feedback are appreciated.

Thanks

joaogante · June 8, 2023, 10:09am

Hey @ToasterLeavin – just as you wrote, a positive length penalty promotes longer sequences! You can find more information about that parameter (and others) in this doc page

Let me know if it is not clear

ToasterLeavin · June 8, 2023, 7:30pm

Thanks for the reply. Sorry for being unclear in my quetsion, I was asking about encoder_repetition_penalty not length_penalty .

Topic		Replies	Views
Whisper: Summarization Task or ASR + Summarization Trained End to End Models	1	536	December 19, 2023
Don't know where to start. Please help manipulating transcribed audio Beginners	0	203	March 11, 2024
Modifying Whisper using Domain Specific Attention Beginners	2	853	June 15, 2025
Model Suggestion on Text correction Beginners	0	766	April 2, 2021
Whisper fine tuning on custom audio data Beginners	4	2720	February 15, 2025

What's the state of the art with spoken transcript editing?

Related topics