Ä  token inserted by ByteLevelBPETokenizer

In this example notebook How to train a new language model from scratch using Transformers and Tokenizers, i notice that after the encoding step, some new characters are introduced to some of the tokens:

tokenizer.encode("Mi estas Julien.").tokens

Results in

['<s>', 'Mi', 'Ä estas', 'Ä Juli', 'en', '.', '</s>']

What is the significance of these Gā€™s with marks over them? I thought they could represent something like a word that is continuing, or that is broken apart by the tokenizer, but that doesnā€™t appear to be true in this example.

I ask in part because when I try to train this on another language, I end up with these same Gā€™s but they are standalone tokens

tokenizer.encode("Waan ku salaamayaa.").tokens
['<s>', 'W', 'aan', 'Ä ', 'ku', 'Ä ', 'sal', 'aam', 'ay', 'aa', '.', '</s>']

Because I am not sure of there meaning, I am not sure if this is a problem, and I am not sure if it is a problem that they are separate tokens in my example, but they are only at the beginning of the tokens in the sample code.

Edit: I can see from researching some that this represents a space. But why does it seem to be attached to other tokens in the example, but it stands alone in my project?

1 Like