OK then I would not have learned that otherwise from asking you. Is this an industry known thing that happens with the all of the transformers API, just curious.
Also, was it intentional to use that weird underscore character? â â In your code you are using that _ â normal underscore is here
It is subtle, but they are different. Does this really matter?
OK, if anyone else is interested from the SentencePiece docs that I just discovered:
SentencePiece treats the input text just as a sequence of Unicode characters. Whitespace is also handled as a normal symbol. To handle the whitespace as a basic token explicitly, SentencePiece first escapes the whitespace with a meta symbol âââ (U+2581) as follows.