Many ambiguous unicode characters for trained tokenizer


I trained a tokenizer on english wikipedia texts. Everything works fine, however, looking into the vocabulary of the saved tokenizer, I find a multitude of unicode characters which I would not expect to be included in the corpus, e.g., many chinese symbols.

You can take a look at the vocabulary here: finroberta/dicts_and_tokenizers/wikipedia_tokenizer.json at main 路 RalfKellner/finroberta 路 GitHub

Odd symbols start at id 68 and go until approximately 10,000 (out of 40,000 tokens).

I use unicode normalization when training the tokenizer which I thought would prevent this behavior. More concrete, I use NFD normalization, however, I also tried other forms of normalization.

This also happens when using a domain specific corpus from the financial markets area, so I was wondering if I am doing something wrong here?

The training script can be found here: finroberta/ at main 路 RalfKellner/finroberta 路 GitHub

I would appreciate any help or clarification! Many thanks in advance!