Summarization on long documents

Hey @MoritzLaurer @echatzikyriakidis

The regex .(?!\d)|\n works for Python, it just says to split where there is a full stop (not followed by a number, to avoid splitting floating points) or a new line. Consider changing it to what’s more suitable to you. For example I do not have any URL in my text, otherwise it would be a problem.

num_tok is the number of tokens of the entire text text.

The tokenizer is either Bart or Pegasus, works for both. I use the tokenize function so that i do not get BOS and EOS for each sentence.

2 Likes