I want to ask a question about the LlamaTokenizerFast, or TokenizerFast in general.
I am working on a project to create LLM in support of the Arabic language, we’ve decided to extend the LlamaTokenizer with Arabic tokens following the Chinese-LLama approach.
We add the tokens and get a new tokenizer. When converting it to TokenizerFast just by reading by “AutoTokenizer” " the Tokenizer fast generates on average 8 more extra tokens for Arabic text.
Can any please explain what is going on?