Tokenizer splits words with accents into separate subwords

Hi there,

My aim is to finetune an existing pretrained LLM on a new language. My new language contains vowels with the following accents: [β€˜Δ€β€™, β€˜Δβ€™, β€˜Δ’β€™, β€˜Δ“β€™, β€˜Δͺ’, β€˜Δ«β€™, β€˜ΕŒβ€™, β€˜Εβ€™, β€˜Εͺ’, β€˜Ε«β€™].

I first train a new tokenizer on my target language. This tokenizer performs well on my target language. Common words that include accents are split into a new token.

new_tokenizer = tokenizer.train_new_from_iterator(training_corpus, new_vocab_size)

I then add the vocab from the the new tokenizer to the pretrained tokenizer:

new_vocab = list(new_tokenizer.vocab.keys()) tokenizer.add_tokens(new_vocab)

My issue is that when I use the tokenizer with the added tokens, it ignores the added tokens and always split accented vowels into a seperate token.

Does anyone know how to solve this issue?