Summarization on long documents

@echatzikyriakidis I ran into a similar issue and found that the Tokenizer class provides most of what you need if striding and summarizing across the document is sufficient, e.g.

def do_strided_tokenization(the_content,
                        tokenizer,
                        number_token_strides = 2,
                        maximum_number_of_tokens = None):
    if not maximum_number_of_tokens:
        maximum_number_of_tokens = tokenizer.model_max_length
    strided_tokenized_content = None
    the_input_ids =\
        tokenizer.batch_encode_plus(
                [the_content], 
                 return_overflowing_tokens=True,
                 truncation=True,
                 max_length=maximum_number_of_tokens,
                 stride=number_token_strides
            )['input_ids']

    strided_tokenized_content =\
        tokenizer.batch_decode(the_input_ids)
    
    return strided_tokenized_content

test_string = 'The red fox jumped over the blue, rainbow.'

print(
    do_strided_tokenization(test_string,
                            summarizer.summarizer.tokenizer,
                            number_token_strides=2,
                            maximum_number_of_tokens=5)
)
print(summarizer.summarizer.tokenizer.model_max_length)
>> ['The red fox jumped</s>', 'fox jumped over the</s>', 'over the blue,</s>', 
>> 'blue, rainbow.</s>']
>> 512

The end of string (</s>) token can be skipped too or the even the batch_encode passed directly to the model instead of using batch_decode. hth.