Changing Tokenizer's max_length gets weird result

Mapraw · May 17, 2022, 4:33am

Hello, I try to tokenize the sentence with “bert-base-uncased” with 3 max_length with these sentences " [‘I love it’, “You done”],[“Mary do”, “Dog eats paper”]" and it returns a lot of sentence with more max_length than I set. Please, describe this phenomenon.

Please.

cog · May 17, 2022, 6:45am

hi.
if you use it without ‘return_overflowing_token’, return successfully truncated token.

Also, i tried your same code. if ‘return_overflowing_token = True’ exist, return Error code.
Huggingface document say that

return_overflowing_tokens ( bool , optional , defaults to False ) — Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch of pairs) is provided with truncation_strategy = longest_first or True , an error is raised instead of returning overflowing tokens.

cog · May 17, 2022, 6:45am

Topic		Replies	Views
`return_overflowing_tokens` with something like total_max_length 🤗Transformers	0	558	January 4, 2024
What does this warning mean? -overflowing tokens are not returned for the setting you have chosen 🤗Tokenizers	1	5415	March 30, 2022
No maximum length is provided with camembert-large 🤗Transformers	0	827	February 3, 2022
Token indices sequence length is longer than the specified maximum sequence length 🤗Tokenizers	4	23457	February 15, 2023
Tokenizing two sentences with the tokenizer Beginners	1	2852	October 18, 2021

Changing Tokenizer's max_length gets weird result

Related topics