Why is overflow_to_sample_mapping missing?

s2458588 · January 30, 2023, 9:55pm

Hi HF-Community , there seems to be a problem with my tokenization for MLM

TLDR;
I am calling my tokenizer like this (special attention to return_overflowing_tokens=True):

tokenizer.encode_plus(tokenized_sequence, return_token_type_ids=True, return_attention_mask=True, padding=True, return_overflowing_tokens=True)

from an instance of BertTokenizer.from_pretrained

But it returns:

TypeError                                 Traceback (most recent call last)
/tmp/ipykernel_258362/4272170064.py in <module>
      2 sentence_tokenids = [3, 233, 5029, 57, 3966, 9175, 30000, 26, 6812, 30001, 7, 961, 50, 21, 12508, 208, 25468, 30002, 5915, 26901, 159, 57, 4]
      3 
----> 4 test = tokenizer.encode_plus(sentence_tokens[1:-1], return_token_type_ids=True, return_attention_mask=True, padding=True, return_overflowing_tokens=True)
      5 

~/builds/anaconda3/envs/tokenizer3.7/lib/python3.7/site-packages/transformers/tokenization_utils_base.py in encode_plus(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2626             return_length=return_length,
   2627             verbose=verbose,
-> 2628             **kwargs,
   2629         )
   2630 

~/builds/anaconda3/envs/tokenizer3.7/lib/python3.7/site-packages/transformers/tokenization_utils.py in _encode_plus(self, text, text_pair, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
    666             return_special_tokens_mask=return_special_tokens_mask,
    667             return_length=return_length,
--> 668             verbose=verbose,
    669         )
    670 

~/builds/anaconda3/envs/tokenizer3.7/lib/python3.7/site-packages/transformers/tokenization_utils_base.py in prepare_for_model(self, ids, pair_ids, add_special_tokens, padding, truncation, max_length, stride, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, prepend_batch_axis, **kwargs)
   3050         if return_overflowing_tokens:
   3051             encoded_inputs["overflowing_tokens"] = overflowing_tokens
-> 3052             encoded_inputs["num_truncated_tokens"] = total_len - max_length
   3053 
   3054         # Add special tokens

TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'

Why? How do I get the overflowing tokens?

The bigger picture with more context:
My code is interfering with the tokenization of an instance of BertTokenizer.from_pretrained (the pre-trained tokenizer tokenizes everything but tokens with a certain part-of-speech, those are tokenized by my algorithm).
Together they produce a standard encoding, e.g.

{'input_ids': [3, 233, 5029, 57, 3966, 9175, 30000, 26, 6812, 30001, 7, 961, 50, 21, 12508, 208, 25468, 30002, 5915, 26901, 159, 57, 4], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Is it possible to generate the output overflow_to_sample_mapping for an encoding as seen in tokenization_utils_fast ? I can’t make sense of its behaviour from the source code. I read what it says in the documentation, but did not understand it well enough to replicate the full encoding myself.

Any hints would be greatly appreciated

s2458588 · February 1, 2023, 5:20pm

I was able to mitigate the error by manually setting the max_length of my model and passing the padding='max_length argument to my Tokenizer.

Topic		Replies	Views
Overflowing Tokens in MarkupLM 🤗Tokenizers	0	442	March 31, 2023
Missing, yet not missing, input_ids 🤗Transformers	2	1322	June 14, 2024
`return_overflowing_tokens` with something like total_max_length 🤗Transformers	0	517	January 4, 2024
NER Label tokenization with overflowing tokens 🤗Tokenizers	4	1427	November 6, 2023
Changing Tokenizer's max_length gets weird result Beginners	2	428	May 17, 2022

Why is overflow_to_sample_mapping missing?

Related topics