How to ensure that tokenizers never truncate partial words?

Hi Nasheed, I’m quite curious about your use case and why you’re interested in never partially truncating, if you don’t mind sharing!

In any case, here is how I would do it: Increase max_length by 1. Tokenize the text. Decode the tokenized text. Check if the second to last token (the one before the final [CLS] token) starts with ## (the prefix that signifies that a longer token was split). If yes, remove both tokens, the one that starts with ## and the one before that. If not, just remove the one before the [CLS] token.

In your example it would be

[CLS] I am Nasheed and I like xylo ##phones [CLS]

Because the second to last token starts with ## you would remove that token and the token before it.

Hope that helps.

Cheers
Heiko

2 Likes