Whisper model for timit dataset

monsetrum · February 11, 2023, 4:18pm

Hello!

I am trying to finetune whisper model for timit dataset. The problem is about the tokenization process. Timit’s transcriptions consist of phonemes not words. Meanwhile, whisper uses gpt2’s idea with Byte Level BPE which is eligible for word prediction. So, here are the questions:

Any ideas, on whether byte level bpe would still be suitable for morpheme prediction? Because I assume that it would be better to make use of a straightforward encoding, i.e. without merges.txt
How to initialize a vocabulary for whisper / gpt-2 models at least in one of the cases (with merges.txt and without)? Tried to just initialize a vocab with unique morphemes, but it throws an exception because of some hard-coded part with special tokens.

rhss10 · April 17, 2023, 7:21am

Hello!

Did you solve the problems? I encountered the exact same problems and wondered if there were any suggestions. Thanks in advance!

Topic		Replies	Views
Fine Tuning Whisper on my own Dataset with a customized Tokenizer Beginners	16	12432	February 12, 2024
WordLevel Tokenization with GPT2? 🤗Transformers	1	738	March 26, 2023
Pretrain and Fine Tune Byte-level model for multilingual extractive QA (Like ByT5) Flax/JAX Projects	13	1986	July 2, 2021
Finetuned whisper model translating instead of transcribing 🤗Transformers	2	736	December 31, 2023
How to update vocabulary of whisper processor 🤗Transformers	1	151	March 28, 2024

Whisper model for timit dataset

Related topics