Hey,
I trained a tokenizer on english wikipedia texts. Everything works fine, however, looking into the vocabulary of the saved tokenizer, I find a multitude of unicode characters which I would not expect to be included in the corpus, e.g., many chinese symbols.
You can take a look at the vocabulary here: finroberta/dicts_and_tokenizers/wikipedia_tokenizer.json at main 路 RalfKellner/finroberta 路 GitHub
Odd symbols start at id 68 and go until approximately 10,000 (out of 40,000 tokens).
I use unicode normalization when training the tokenizer which I thought would prevent this behavior. More concrete, I use NFD normalization, however, I also tried other forms of normalization.
This also happens when using a domain specific corpus from the financial markets area, so I was wondering if I am doing something wrong here?
The training script can be found here: finroberta/00_train_wikipedia_tokenizer.py at main 路 RalfKellner/finroberta 路 GitHub
I would appreciate any help or clarification! Many thanks in advance!
Cheers,
Ralf