Hi HF-Community , there seems to be a problem with my tokenization for MLM
TLDR;
I am calling my tokenizer like this (special attention to return_overflowing_tokens=True
):
tokenizer.encode_plus(tokenized_sequence, return_token_type_ids=True, return_attention_mask=True, padding=True, return_overflowing_tokens=True)
from an instance of BertTokenizer.from_pretrained
But it returns:
TypeError Traceback (most recent call last)
/tmp/ipykernel_258362/4272170064.py in <module>
2 sentence_tokenids = [3, 233, 5029, 57, 3966, 9175, 30000, 26, 6812, 30001, 7, 961, 50, 21, 12508, 208, 25468, 30002, 5915, 26901, 159, 57, 4]
3
----> 4 test = tokenizer.encode_plus(sentence_tokens[1:-1], return_token_type_ids=True, return_attention_mask=True, padding=True, return_overflowing_tokens=True)
5
~/builds/anaconda3/envs/tokenizer3.7/lib/python3.7/site-packages/transformers/tokenization_utils_base.py in encode_plus(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2626 return_length=return_length,
2627 verbose=verbose,
-> 2628 **kwargs,
2629 )
2630
~/builds/anaconda3/envs/tokenizer3.7/lib/python3.7/site-packages/transformers/tokenization_utils.py in _encode_plus(self, text, text_pair, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
666 return_special_tokens_mask=return_special_tokens_mask,
667 return_length=return_length,
--> 668 verbose=verbose,
669 )
670
~/builds/anaconda3/envs/tokenizer3.7/lib/python3.7/site-packages/transformers/tokenization_utils_base.py in prepare_for_model(self, ids, pair_ids, add_special_tokens, padding, truncation, max_length, stride, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, prepend_batch_axis, **kwargs)
3050 if return_overflowing_tokens:
3051 encoded_inputs["overflowing_tokens"] = overflowing_tokens
-> 3052 encoded_inputs["num_truncated_tokens"] = total_len - max_length
3053
3054 # Add special tokens
TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'
Why? How do I get the overflowing tokens?
The bigger picture with more context:
My code is interfering with the tokenization of an instance of BertTokenizer.from_pretrained (the pre-trained tokenizer tokenizes everything but tokens with a certain part-of-speech, those are tokenized by my algorithm).
Together they produce a standard encoding, e.g.
{'input_ids': [3, 233, 5029, 57, 3966, 9175, 30000, 26, 6812, 30001, 7, 961, 50, 21, 12508, 208, 25468, 30002, 5915, 26901, 159, 57, 4], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Is it possible to generate the output overflow_to_sample_mapping
for an encoding as seen in tokenization_utils_fast ? I can’t make sense of its behaviour from the source code. I read what it says in the documentation, but did not understand it well enough to replicate the full encoding myself.
Any hints would be greatly appreciated