How to ensure that tokenizers never truncate partial words?

marshmellow77 · January 23, 2022, 11:08pm

Hi Nasheed, I’m quite curious about your use case and why you’re interested in never partially truncating, if you don’t mind sharing!

In any case, here is how I would do it: Increase max_length by 1. Tokenize the text. Decode the tokenized text. Check if the second to last token (the one before the final [CLS] token) starts with ## (the prefix that signifies that a longer token was split). If yes, remove both tokens, the one that starts with ## and the one before that. If not, just remove the one before the [CLS] token.

In your example it would be

[CLS] I am Nasheed and I like xylo ##phones [CLS]

Because the second to last token starts with ## you would remove that token and the token before it.

Hope that helps.

Cheers
Heiko

Topic		Replies	Views
How to ensure the `overflow` with `stride` always starts with a full word? 🤗Tokenizers	0	1271	January 24, 2022
How to truncate from the head in AutoTokenizer? 🤗Tokenizers	2	4655	September 26, 2020
Truncate the seq. not working 🤗Transformers	0	835	August 17, 2022
Padding and truncation for custom tokenizer 🤗Tokenizers	1	643	January 22, 2023
No maximum length is provided with camembert-large 🤗Transformers	0	816	February 3, 2022

How to ensure that tokenizers never truncate partial words?

Related topics