Why is Code LLama token for prefix, suffix, etc weird underscore character

I noticed these lines of code in the transformers library:

If you see those funky underscore characters, why are they there? Isn’t it supposed to be just this in the lines of code:

        prefix_token="<PRE>",
        middle_token="<MID>",
        suffix_token="<SUF>",
        eot_token="<EOT>",
        fill_token="<FILL_ME>"

@ArthurZ do you know why you put those in?

Hey! That’s because otherwise sentencepiece does not recognize them. And it will split them!
The sentencepiece model was trained in such a way

1 Like

BTW, thanks for replying so quickly!

OK then I would not have learned that otherwise from asking you. Is this an industry known thing that happens with the all of the transformers API, just curious.

Also, was it intentional to use that weird underscore character?
▁ ← In your code you are using that
_ ← normal underscore is here

It is subtle, but they are different. Does this really matter?

OK, if anyone else is interested from the SentencePiece docs that I just discovered:

SentencePiece treats the input text just as a sequence of Unicode characters. Whitespace is also handled as a normal symbol. To handle the whitespace as a basic token explicitly, SentencePiece first escapes the whitespace with a meta symbol “▁” (U+2581) as follows.