However, if we need to make it support new language (which is not supported by the tokenizer), how could I do that? Could you please point me to the document or example which I could follow?
Yes, i’m interested in this too. Particularly for very low resource languages like Wolof - do you need to train a BPE tokenizer given Wolof transcriptions, then pass the vocab.json and the merge file to the WhisperTokenizer?
did you by any chance figure out a solution? I’m in the same situation now for a different language, and I wonder if you have some advice for me.
Thanks in advance!