@echatzikyriakidis I ran into a similar issue and found that the Tokenizer class provides most of what you need if striding and summarizing across the document is sufficient, e.g.
def do_strided_tokenization(the_content,
tokenizer,
number_token_strides = 2,
maximum_number_of_tokens = None):
if not maximum_number_of_tokens:
maximum_number_of_tokens = tokenizer.model_max_length
strided_tokenized_content = None
the_input_ids =\
tokenizer.batch_encode_plus(
[the_content],
return_overflowing_tokens=True,
truncation=True,
max_length=maximum_number_of_tokens,
stride=number_token_strides
)['input_ids']
strided_tokenized_content =\
tokenizer.batch_decode(the_input_ids)
return strided_tokenized_content
test_string = 'The red fox jumped over the blue, rainbow.'
print(
do_strided_tokenization(test_string,
summarizer.summarizer.tokenizer,
number_token_strides=2,
maximum_number_of_tokens=5)
)
print(summarizer.summarizer.tokenizer.model_max_length)
>> ['The red fox jumped</s>', 'fox jumped over the</s>', 'over the blue,</s>',
>> 'blue, rainbow.</s>']
>> 512
The end of string (</s>
) token can be skipped too or the even the batch_encode
passed directly to the model instead of using batch_decode
. hth.