Is there a way to ensure
tokenizers never partially truncate a word, as illustrated below:
tokenizer.decode(tokenizer('I am Nasheed and I like xylophones.', truncation=True, max_length=12)['input_ids'])
The output is the above sentence truncated like: “’[CLS] I am Nasheed and I like xylo [CLS]’”
I want it to be truncated as: “’[CLS] I am Nasheed and I like [CLS]’”
Is there a way to enforce this?
Hi Nasheed, I’m quite curious about your use case and why you’re interested in never partially truncating, if you don’t mind sharing!
In any case, here is how I would do it: Increase
max_length by 1. Tokenize the text. Decode the tokenized text. Check if the second to last token (the one before the final [CLS] token) starts with
## (the prefix that signifies that a longer token was split). If yes, remove both tokens, the one that starts with
## and the one before that. If not, just remove the one before the [CLS] token.
In your example it would be
[CLS] I am Nasheed and I like xylo ##phones [CLS]
Because the second to last token starts with
## you would remove that token and the token before it.
Hope that helps.
Thanks for the quick response @marshmellow77
I am working on a paper that aims to extend all Transformer models and architectures beyond the 512 token limit. A principal part of how I do this is through splitting up (with overlap) my original document/text.
For last words that are longer than 3 tokens, I should recursively remove tokens from the end so long as they have the
## prefix and post that remove 1 more token which is the start of the word.
I am curious whether the approach you have described would also work with sentencepiece tokenizers? I will update it here when post experimentation.
Also, I have a follow-up question about controlling the stride overlap behavior of tokenizers along the lines of the original post. I will post a link to that discussion here as well.
Edit 1: This is the follow-up question.