Tokenizer splits words with accents into separate subwords

leestevennz · June 20, 2024, 1:26am

Hi there,

My aim is to finetune an existing pretrained LLM on a new language. My new language contains vowels with the following accents: [‘Ā’, ‘ā’, ‘Ē’, ‘ē’, ‘Ī’, ‘ī’, ‘Ō’, ‘ō’, ‘Ū’, ‘ū’].

I first train a new tokenizer on my target language. This tokenizer performs well on my target language. Common words that include accents are split into a new token.

new_tokenizer = tokenizer.train_new_from_iterator(training_corpus, new_vocab_size)

I then add the vocab from the the new tokenizer to the pretrained tokenizer:

new_vocab = list(new_tokenizer.vocab.keys()) tokenizer.add_tokens(new_vocab)

My issue is that when I use the tokenizer with the added tokens, it ignores the added tokens and always split accented vowels into a seperate token.

Does anyone know how to solve this issue?

Topic		Replies	Views
How to add special tokens to a pretrained model? Beginners	0	387	June 18, 2021
Building a custom Java tokenizer 🤗Tokenizers	0	625	February 4, 2024
Adding tokens, but tokenizer doesn't use them 🤗Tokenizers	1	399	August 14, 2024
How to properly add news tokens to tokenizer vocab? Beginners	0	155	May 14, 2024
Get Problem with Doubled tokens in NLLB Tokenizer After load new vocab! 🤗Tokenizers	0	271	November 21, 2023

Tokenizer splits words with accents into separate subwords

Related topics