How to ensure that tokenizers never truncate partial words?

nasheed · January 24, 2022, 3:03am

Thanks for the quick response @marshmellow77
I am working on a paper that aims to extend all Transformer models and architectures beyond the 512 token limit. A principal part of how I do this is through splitting up (with overlap) my original document/text.

For last words that are longer than 3 tokens, I should recursively remove tokens from the end so long as they have the ## prefix and post that remove 1 more token which is the start of the word.

I am curious whether the approach you have described would also work with sentencepiece tokenizers? I will update it here when post experimentation.

Also, I have a follow-up question about controlling the stride overlap behavior of tokenizers along the lines of the original post. I will post a link to that discussion here as well.

Edit 1: This is the follow-up question.

Topic		Replies	Views
How to ensure the `overflow` with `stride` always starts with a full word? 🤗Tokenizers	0	1271	January 24, 2022
How to truncate from the head in AutoTokenizer? 🤗Tokenizers	2	4655	September 26, 2020
Truncate the seq. not working 🤗Transformers	0	835	August 17, 2022
Padding and truncation for custom tokenizer 🤗Tokenizers	1	643	January 22, 2023
No maximum length is provided with camembert-large 🤗Transformers	0	816	February 3, 2022

How to ensure that tokenizers never truncate partial words?

Related topics