Summarization on long documents

Kwame · May 7, 2021, 1:51pm

@echatzikyriakidis I ran into a similar issue and found that the Tokenizer class provides most of what you need if striding and summarizing across the document is sufficient, e.g.

def do_strided_tokenization(the_content,
                        tokenizer,
                        number_token_strides = 2,
                        maximum_number_of_tokens = None):
    if not maximum_number_of_tokens:
        maximum_number_of_tokens = tokenizer.model_max_length
    strided_tokenized_content = None
    the_input_ids =\
        tokenizer.batch_encode_plus(
                [the_content], 
                 return_overflowing_tokens=True,
                 truncation=True,
                 max_length=maximum_number_of_tokens,
                 stride=number_token_strides
            )['input_ids']

    strided_tokenized_content =\
        tokenizer.batch_decode(the_input_ids)
    
    return strided_tokenized_content

test_string = 'The red fox jumped over the blue, rainbow.'

print(
    do_strided_tokenization(test_string,
                            summarizer.summarizer.tokenizer,
                            number_token_strides=2,
                            maximum_number_of_tokens=5)
)
print(summarizer.summarizer.tokenizer.model_max_length)
>> ['The red fox jumped</s>', 'fox jumped over the</s>', 'over the blue,</s>', 
>> 'blue, rainbow.</s>']
>> 512

The end of string (</s>) token can be skipped too or the even the batch_encode passed directly to the model instead of using batch_decode. hth.

Topic		Replies	Views
Output truncation of summaries models 🤗Transformers	0	442	March 30, 2023
Is summary of 1024 tokens not useless? 🤗Transformers	1	673	July 1, 2022
Which summarization model of huggingface supports more than 1024 tokens? Which model is more suitable for programming related articles? 🤗Transformers	1	1777	July 31, 2023
Summarization pipeline on long text Beginners	6	4586	December 14, 2022
Finetuning summarization model using long text data Beginners	6	1056	March 20, 2024

Summarization on long documents

Related topics