How to update vocabulary of whisper processor

Unspoiled-Egg · March 28, 2024, 2:41pm

I want to fine-tune my whisper model on dataset containing transcriptions that are not present in the vocabulary of the processor. How can I update the vocabulary of processor based on the transcripts present in the training dataset ?

I have tried using the code below:

    all_tokens = []
    for sent in dataset:
        all_tokens.extend(tok for tok in dataset.split())

    old_vocab = processor.tokenizer.get_vocab()
    new_tokens = list(set(all_tokens) - set(old_vocab.keys()))

    processor.tokenizer.add_tokens(new_tokens)

But doing this generates no transcriptions when model is used for transcriptions.

Unspoiled-Egg · March 28, 2024, 2:42pm

@sanchit-gandhi can you help me with this?

Topic		Replies	Views
Korean finetuning on Whisper Beginners	1	1605	February 25, 2024
Fine Tuning Whisper on my own Dataset with a customized Tokenizer Beginners	16	12404	February 12, 2024
Finetuned whisper model translating instead of transcribing 🤗Transformers	2	734	December 31, 2023
Openai Whisper Finetune checkpoint in local directory Beginners	0	265	March 21, 2024
How don't destroy the general learning of whisper throught fine tune Beginners	0	58	December 17, 2024

How to update vocabulary of whisper processor

Related topics