Summarization on long documents

marcoabrate · December 2, 2020, 9:49am

The regex .(?!\d)|\n works for Python, it just says to split where there is a full stop (not followed by a number, to avoid splitting floating points) or a new line. Consider changing it to what’s more suitable to you. For example I do not have any URL in my text, otherwise it would be a problem.

num_tok is the number of tokens of the entire text text.

The tokenizer is either Bart or Pegasus, works for both. I use the tokenize function so that i do not get BOS and EOS for each sentence.

Topic		Replies	Views
Summarization pipeline on long text Beginners	6	4495	December 14, 2022
Longformer for text summarization Beginners	10	5251	August 6, 2022
How I fine-tune BART for summarization using large texts? Research	3	3990	September 28, 2020
How Can I Accurately Summarize Long Japanese Texts? Beginners	1	24	April 28, 2025
Help Improving Abstractive Summarization 🤗Transformers	2	985	May 19, 2021

Summarization on long documents

Related topics