Hey @MoritzLaurer @echatzikyriakidis
The regex .(?!\d)|\n
works for Python, it just says to split where there is a full stop (not followed by a number, to avoid splitting floating points) or a new line. Consider changing it to what’s more suitable to you. For example I do not have any URL in my text, otherwise it would be a problem.
num_tok
is the number of tokens of the entire text text
.
The tokenizer
is either Bart or Pegasus, works for both. I use the tokenize
function so that i do not get BOS and EOS for each sentence.