I want to fine-tune my whisper model on dataset containing transcriptions that are not present in the vocabulary of the processor. How can I update the vocabulary of processor based on the transcripts present in the training dataset ?
I have tried using the code below:
all_tokens = []
for sent in dataset:
all_tokens.extend(tok for tok in dataset.split())
old_vocab = processor.tokenizer.get_vocab()
new_tokens = list(set(all_tokens) - set(old_vocab.keys()))
processor.tokenizer.add_tokens(new_tokens)
But doing this generates no transcriptions when model is used for transcriptions.