Many ambiguous unicode characters for trained tokenizer

KellRa · December 31, 2023, 1:06pm

Hey,

I trained a tokenizer on english wikipedia texts. Everything works fine, however, looking into the vocabulary of the saved tokenizer, I find a multitude of unicode characters which I would not expect to be included in the corpus, e.g., many chinese symbols.

You can take a look at the vocabulary here: finroberta/dicts_and_tokenizers/wikipedia_tokenizer.json at main · RalfKellner/finroberta · GitHub

Odd symbols start at id 68 and go until approximately 10,000 (out of 40,000 tokens).

I use unicode normalization when training the tokenizer which I thought would prevent this behavior. More concrete, I use NFD normalization, however, I also tried other forms of normalization.

This also happens when using a domain specific corpus from the financial markets area, so I was wondering if I am doing something wrong here?

The training script can be found here: finroberta/00_train_wikipedia_tokenizer.py at main · RalfKellner/finroberta · GitHub

I would appreciate any help or clarification! Many thanks in advance!

Cheers,
Ralf

Topic		Replies	Views
Tokenizer for German lang 🤗Tokenizers	0	592	June 22, 2023
Unknown character although it is present in the vocabulary list Beginners	1	424	December 13, 2022
Avoid creating certain tokens when training a tokenizer 🤗Tokenizers	0	600	July 26, 2022
How to use tokenizer.tokenize in Chinese data properly? 🤗Tokenizers	0	906	November 9, 2021
Why BPE encoding trained on English and applied on Bengali doesnot return unknown tokens? Beginners	1	327	February 25, 2024

Many ambiguous unicode characters for trained tokenizer

Related topics