Hi there,
My aim is to finetune an existing pretrained LLM on a new language. My new language contains vowels with the following accents: [βΔβ, βΔβ, βΔβ, βΔβ, βΔͺβ, βΔ«β, βΕβ, βΕβ, βΕͺβ, βΕ«β].
I first train a new tokenizer on my target language. This tokenizer performs well on my target language. Common words that include accents are split into a new token.
new_tokenizer = tokenizer.train_new_from_iterator(training_corpus, new_vocab_size)
I then add the vocab from the the new tokenizer to the pretrained tokenizer:
new_vocab = list(new_tokenizer.vocab.keys()) tokenizer.add_tokens(new_vocab)
My issue is that when I use the tokenizer with the added tokens, it ignores the added tokens and always split accented vowels into a seperate token.
Does anyone know how to solve this issue?