Ġ token inserted by ByteLevelBPETokenizer

innomadic · November 1, 2023, 8:41am

In this example notebook How to train a new language model from scratch using Transformers and Tokenizers, i notice that after the encoding step, some new characters are introduced to some of the tokens:

tokenizer.encode("Mi estas Julien.").tokens

Results in

['<s>', 'Mi', 'Ġestas', 'ĠJuli', 'en', '.', '</s>']

What is the significance of these G’s with marks over them? I thought they could represent something like a word that is continuing, or that is broken apart by the tokenizer, but that doesn’t appear to be true in this example.

I ask in part because when I try to train this on another language, I end up with these same G’s but they are standalone tokens

tokenizer.encode("Waan ku salaamayaa.").tokens
['<s>', 'W', 'aan', 'Ġ', 'ku', 'Ġ', 'sal', 'aam', 'ay', 'aa', '.', '</s>']

Because I am not sure of there meaning, I am not sure if this is a problem, and I am not sure if it is a problem that they are separate tokens in my example, but they are only at the beginning of the tokens in the sample code.

Edit: I can see from researching some that this represents a space. But why does it seem to be attached to other tokens in the example, but it stands alone in my project?

Topic		Replies	Views
GPT2TokenizerFast tokenzied output Beginners	0	154	December 29, 2023
Why do I get 'Ġ' when adding emojis to the tokenizer? Beginners	1	2051	June 27, 2021
RobertaTokenizer decode and tokenize do not have the same output 🤗Tokenizers	0	247	October 24, 2023
transformers.Tokenizer produce unexpected results 🤗Transformers	0	208	April 26, 2023
BPE tokenizers and spaces before words 🤗Transformers	4	26436	September 8, 2023

Ġ token inserted by ByteLevelBPETokenizer

Related topics