Changing Tokenizer's max_length gets weird result

Hello, I try to tokenize the sentence with “bert-base-uncased” with 3 max_length with these sentences " [‘I love it’, “You done”],[“Mary do”, “Dog eats paper”]" and it returns a lot of sentence with more max_length than I set. Please, describe this phenomenon.


Please.

hi.
if you use it without ‘return_overflowing_token’, return successfully truncated token.

Also, i tried your same code. if ‘return_overflowing_token = True’ exist, return Error code.
Huggingface document say that

  • return_overflowing_tokens ( bool , optional , defaults to False ) — Whether or not to return overflowing token sequences. If a pair of sequences of input ids (or a batch of pairs) is provided with truncation_strategy = longest_first or True , an error is raised instead of returning overflowing tokens.
1 Like

1 Like