Is there a way to ensure tokenizers never partially truncate a word, as illustrated below:
tokenizer= SomeTokenizer.from_pretrained('some/path')
tokenizer.decode(tokenizer('I am Nasheed and I like xylophones.', truncation=True, max_length=12)['input_ids'])
The output is the above sentence truncated like: “’[CLS] I am Nasheed and I like xylo [CLS]’”
I want it to be truncated as: “’[CLS] I am Nasheed and I like [CLS]’”
Hi Nasheed, I’m quite curious about your use case and why you’re interested in never partially truncating, if you don’t mind sharing!
In any case, here is how I would do it: Increase max_length by 1. Tokenize the text. Decode the tokenized text. Check if the second to last token (the one before the final [CLS] token) starts with ## (the prefix that signifies that a longer token was split). If yes, remove both tokens, the one that starts with ## and the one before that. If not, just remove the one before the [CLS] token.
In your example it would be
[CLS] I am Nasheed and I like xylo ##phones [CLS]
Because the second to last token starts with ## you would remove that token and the token before it.
Thanks for the quick response @marshmellow77
I am working on a paper that aims to extend all Transformer models and architectures beyond the 512 token limit. A principal part of how I do this is through splitting up (with overlap) my original document/text.
For last words that are longer than 3 tokens, I should recursively remove tokens from the end so long as they have the ## prefix and post that remove 1 more token which is the start of the word.
I am curious whether the approach you have described would also work with sentencepiece tokenizers? I will update it here when post experimentation.
Also, I have a follow-up question about controlling the stride overlap behavior of tokenizers along the lines of the original post. I will post a link to that discussion here as well.