FastTokenizer add 10 more tokens in Avg

Hennara · January 20, 2024, 5:48am

Hi,
I want to ask a question about the LlamaTokenizerFast, or TokenizerFast in general.
I am working on a project to create LLM in support of the Arabic language, we’ve decided to extend the LlamaTokenizer with Arabic tokens following the Chinese-LLama approach.
We add the tokens and get a new tokenizer. When converting it to TokenizerFast just by reading by “AutoTokenizer” " the Tokenizer fast generates on average 8 more extra tokens for Arabic text.
Can any please explain what is going on?

Topic		Replies	Views
Are the slow and fast tokenizer results the same output for the same input? 🤗Tokenizers	0	563	August 30, 2023
Introducing FlashTokenizer: The World’s Fastest Tokenizer Library for LLM Inference. I need more awesome optimized skills. Join Beginners	2	89	March 21, 2025
Convert slow XLMRobertaTokenizer to fast one 🤗Transformers	3	1189	August 26, 2024
Introducing FlashTokenizer: The World's Fastest Tokenizer Library for LLM Inference 🤗Tokenizers	2	35	March 21, 2025
🚀 Introducing FlashTokenizer: The World's Fastest CPU Tokenizer! 🤗Transformers	2	34	April 4, 2025

FastTokenizer add 10 more tokens in Avg

Related topics