In this example notebook How to train a new language model from scratch using Transformers and Tokenizers, i notice that after the encoding step, some new characters are introduced to some of the tokens:
tokenizer.encode("Mi estas Julien.").tokens
['<s>', 'Mi', 'Ġestas', 'ĠJuli', 'en', '.', '</s>']
What is the significance of these G’s with marks over them? I thought they could represent something like a word that is continuing, or that is broken apart by the tokenizer, but that doesn’t appear to be true in this example.
I ask in part because when I try to train this on another language, I end up with these same G’s but they are standalone tokens
tokenizer.encode("Waan ku salaamayaa.").tokens ['<s>', 'W', 'aan', 'Ġ', 'ku', 'Ġ', 'sal', 'aam', 'ay', 'aa', '.', '</s>']
Because I am not sure of there meaning, I am not sure if this is a problem, and I am not sure if it is a problem that they are separate tokens in my example, but they are only at the beginning of the tokens in the sample code.
Edit: I can see from researching some that this represents a space. But why does it seem to be attached to other tokens in the example, but it stands alone in my project?