Korean finetuning on Whisper

Jin0 · October 16, 2023, 1:03pm

Hello.

I am performing fine tuning in Korean by referring to the code.
Fine-Tune Whisper For Multilingual ASR with Transformers

processor = WhisperProcessor.from_pretrained(“openai/whisper-small”, language=“Korean”, task=“transcribe”)
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)
tokenizer = WhisperTokenizer.from_pretrained(“openai/whisper-small”, language=“Korean”, task=“transcribe”)
feature_extractor = WhisperFeatureExtractor.from_pretrained(“openai/whisper-small”)
model = WhisperForConditionalGeneration.from_pretrained(“openai/whisper-small”)

The Korean pre trained model works well.

By loading the pre-trained model in this way,
Even if you use it and learn a little with a small dataset, it shows amazing performance.

But there is something I would like to change.

. and ? are added appropriately depending on the voice, but I would like to remove this.
=> Should I post-process the string inferred by the model?
I initially tried to delete it from the tokenizer’s dictionary of words, but failed.
It seems that they know the numbers that can be written in Korean and are writing them as numbers rather than Korean.
=> The result I want is 사백 만
The model inferred was 4백 만

Any ideas on this would be appreciated.

itaipee · February 25, 2024, 3:41pm

never tried Korean , but in general , if the fine-tuning materials contains enough examples of numbers-as-words , instead of numbers-as-digits ,than the fine-tuned model will follow these examples

Topic		Replies	Views
Finetuned whisper model translating instead of transcribing 🤗Transformers	2	737	December 31, 2023
Fine Tuning Whisper on my own Dataset with a customized Tokenizer Beginners	16	12437	February 12, 2024
Open ai whisper fine tuning on unknown language Beginners	0	80	October 1, 2024
Openai Whisper Finetune checkpoint in local directory Beginners	0	265	March 21, 2024
How to finetune Whisper with language which is not supported in WhisperTokenizer Beginners	4	832	May 18, 2024

Korean finetuning on Whisper

Related topics