Why is Code LLama token for prefix, suffix, etc weird underscore character

Thinkcru · October 16, 2023, 7:35pm

I noticed these lines of code in the transformers library:

huggingface/transformers/blob/fd6a0ade9b89c415ea213ef1aa07c9b2c32a4d75/src/transformers/models/code_llama/tokenization_code_llama.py#L136-L139


      
          prefix_token="▁<PRE>",
          middle_token="▁<MID>",
          suffix_token="▁<SUF>",
          eot_token="▁<EOT>",

If you see those funky underscore characters, why are they there? Isn’t it supposed to be just this in the lines of code:

        prefix_token="<PRE>",
        middle_token="<MID>",
        suffix_token="<SUF>",
        eot_token="<EOT>",
        fill_token="<FILL_ME>"

Thinkcru · October 16, 2023, 7:36pm

@ArthurZ do you know why you put those in?

ArthurZ · October 16, 2023, 7:56pm

Hey! That’s because otherwise sentencepiece does not recognize them. And it will split them!
The sentencepiece model was trained in such a way

Thinkcru · October 16, 2023, 7:58pm

BTW, thanks for replying so quickly!

OK then I would not have learned that otherwise from asking you. Is this an industry known thing that happens with the all of the transformers API, just curious.

Also, was it intentional to use that weird underscore character?
▁ ← In your code you are using that
_ ← normal underscore is here

It is subtle, but they are different. Does this really matter?

Thinkcru · October 16, 2023, 8:27pm

OK, if anyone else is interested from the SentencePiece docs that I just discovered:

SentencePiece treats the input text just as a sequence of Unicode characters. Whitespace is also handled as a normal symbol. To handle the whitespace as a basic token explicitly, SentencePiece first escapes the whitespace with a meta symbol “▁” (U+2581) as follows.

Topic		Replies	Views
Llama2 tokenizer duplicate ids Beginners	2	1435	April 21, 2024
Slow Tokenizer adds whitespace after special token 🤗Transformers	4	1392	August 8, 2023
SentencePiece tokenizer Beginners	2	127	February 22, 2025
Tokenizer method inference 🤗Tokenizers	3	42	November 2, 2024
Question about llama fine tuning dataset token string Beginners	1	14	May 17, 2025

Why is Code LLama token for prefix, suffix, etc weird underscore character

Related topics