I am trying to finetune whisper model for timit dataset. The problem is about the tokenization process. Timit’s transcriptions consist of phonemes not words. Meanwhile, whisper uses gpt2’s idea with Byte Level BPE which is eligible for word prediction. So, here are the questions:
Any ideas, on whether byte level bpe would still be suitable for morpheme prediction? Because I assume that it would be better to make use of a straightforward encoding, i.e. without merges.txt
How to initialize a vocabulary for whisper / gpt-2 models at least in one of the cases (with merges.txt and without)? Tried to just initialize a vocab with unique morphemes, but it throws an exception because of some hard-coded part with special tokens.