Nice @Kwame . What your implementation has is actually overlapping chunks. But I donāt think if it is ok to cut a sentence in half. My implementation cuts the text in chunks so that they can be summarized by a model but it is never chunks a sentence in two parts. If a sentence cannot be added to the chunk it is transfered to the next chunk.
you can use this approach for your abstractive summarization
OpenAI just published a pretty amazing piece of work showing how they combined reinforcement learning with human feedback to summarise entire books
Unfortunately, the models donāt appear to be open sourced, but the general technique is very cool and fun to read
h/t @lvwerra for telling me about this paper
To be more specific we can use
tokenizer.max_len_single_sentence
as the rest of the space is taken up by the special tokens expected by the model. If I am not wrong
tokenizer.model_max_length = tokenizer.max_len_single_sentence + tokenizer.num_special_tokens_to_add()
How exactly are you delimiting the sentence boundaries?
@MoritzLaurer 's code is really cool. But it will most likely strip off the special tokens and its only going to pass the input_ids to the model. I thought why not pass the model whatever it has originally seen during its training. So, I have re-written the logic below and also taken care of not breaking the sentences into parts with the help of nltk
. @echatzikyriakidis
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import nltk
nltk.download('punkt')
checkpoint = "google/pegasus-xsum"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
long_text = "This is a very very long text. " * 300
sentences = nltk.tokenize.sent_tokenize(long_text)
# initialize
length = 0
chunk = ""
chunks = []
count = -1
for sentence in sentences:
count += 1
combined_length = len(tokenizer.tokenize(sentence)) + length # add the no. of sentence tokens to the length counter
if combined_length <= tokenizer.max_len_single_sentence: # if it doesn't exceed
chunk += sentence + " " # add the sentence to the chunk
length = combined_length # update the length counter
# if it is the last sentence
if count == len(sentences) - 1:
chunks.append(chunk) # save the chunk
else:
chunks.append(chunk) # save the chunk
# reset
length = 0
chunk = ""
# take care of the overflow sentence
chunk += sentence + " "
length = len(tokenizer.tokenize(sentence))
# inputs
inputs = [tokenizer(chunk, return_tensors="pt") for chunk in chunks]
# print summary
for input in inputs:
output = model.generate(**input)
print(tokenizer.decode(*output, skip_special_tokens=True))
and the output is:
This is a very very long text.
This is a very very long text.
This is a very very long text.
This is a very very long text.
This is a very very long text.
Some intermediate results:
[len(tokenizer(c.strip()).input_ids) for c in chunks]
gives:
[505, 505, 505, 505, 385]
which are well within tokenizer.model_max_length
of 512
.
Do let me know if anything seems weird.
@saprativa your code is almost the same I tried to code myself, thanks so much. Only last bit Iām missing.
I want to merge several articles on the same topic, so input text can be very long, and I see your code splits all text into sentences to fit the model, but the result I got contains several similar sentences.
So Iām looking at how to sort all sentences by similarity before making a summary. Come up to this video Sentence Similarity using HuggingFace's Sentence Transformers v2 - YouTube, but decided to ask here anyway about your opinion.
Is this is the only way to use cosine_similarity for this (guess I need to compare each sentence with the remaining ones, and based on this sort the list). Ideally, do this on a paragraph basis, but original articles might not have the same paragraphs, so for simplicity, Iām trying to do this just for a sentences list.
Also was looking at this model sentence-transformers/paraphrase-xlm-r-multilingual-v1 Ā· Hugging Face, but due to me being a newbie in this topic, still struggling with how to merge all those solutions to achieve my goal.
Quick question, for the pretrained model google/bigbird-pegasus-large-bigpatent, do we know how did they manage to do the training?
The patents tend to be way longer than 4096 tokensā¦
Are you sure? According to the paper, BigPatent has an average input length similar to PubMed.
We do have an example notebook of evaluating BigBird on PubMed summarization here.
BigPatent actually just took the patent description and used it to generate the patent abstract, so it does not take the whole patent as input (that would be much longer, as you suggested).
Hi all, found this thread very relevant to an exercise Iām undertaking so I thought Iād share my thought process and get your views on it. Iām looking to train a model on the BookSum dataset (same dataset OpenAI used in this paper) to be able to summarise book chapters.
Instead of using the summaries of summaries approach I was looking to use models converted to a LongFormer format to summarise entire chapters in one go. My thinking was to undertake the following experiments:
- Convert t5-3b to a longformer encoder decoder format and finetune on BookSum
- Fine-tune Pegasus_with_Longformer_summarization on BookSum
- Port model from the Efficient Attentions for Long Document Summarization Paper to HF and fine-tune on BookSum
What do you all think?
Hi,
Maybe you can try this model, there is a link to a conversion script at the end of the readme to convert BART models for longer sequences
Hi there, Iāve recently published a survey paper on Abstractive Text Summarization for both short and long documents. As a result, and to the best of my knowledge, the top-performing models for long document summarization on the three popular datasets (arXiv, PubMed, BigPatent) are LSH and BIGBIRD-Pegasus
However, ROUGE scores of the best models in this field, as concluded by the paper, can be shown in the image below.
Good luck in your projects!
can you write a full code?
nltk missing
bart_tokenizer missing
bart_model missing
this is not working when i try to run like below
import logging
from transformers import pipeline
#summarizer = pipeline("summarization", model="csebuetnlp/mT5_multilingual_XLSum")
f = open("TextFile1.txt", "r")
ARTICLE = f.read()
#print(summarizer(ARTICLE, max_length=900, do_sample=False))
#
#summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
summarizer = pipeline("summarization", model="facebook/bart-large-cnn" )
def do_strided_tokenization(the_content,
tokenizer,
number_token_strides = 2,
maximum_number_of_tokens = None):
if not maximum_number_of_tokens:
maximum_number_of_tokens = tokenizer.model_max_length
strided_tokenized_content = None
the_input_ids =\
tokenizer.batch_encode_plus(
[the_content],
return_overflowing_tokens=True,
truncation=True,
max_length=maximum_number_of_tokens,
stride=number_token_strides
)['input_ids']
strided_tokenized_content =\
tokenizer.batch_decode(the_input_ids)
return strided_tokenized_content
test_string = 'The red fox jumped over the blue, rainbow.'
print(
do_strided_tokenization(test_string,
summarizer.summarizer.tokenizer,
number_token_strides=2,
maximum_number_of_tokens=5)
)
print(summarizer.summarizer.tokenizer.model_max_length)
can you post a full working example with input and output and imports
@MoritzLaurer I really like your idea of chopping up tokenizerās [āinput_idsā][0] into chunks and unsqueezing each of them. When I tried it myself, however, I bumped into a huge issue:
āDefaultCPUAllocator: not enough memoryā
I googled how pytorch may throw this error, and it seems to me that itās because a tensor is so big that no reasonable RAM can handle it. Will it be correct to narrow down the source of the problem at torch.unsqueeze()? How can I chop up or even pickle individual tensor even more just so my computer can run the code without dying?
P.S: Anyone is welcome to take a stab at educating me
Thanks!
A slight modification to dipanjanSās solution (btw thank you so much for sharing it).
I didnāt want to use any tokenizer for a different use case, so this is a workaround.
import re
def create_nested_sentences(document, token_max_length):
nested = []
sent = []
length = 0
for sentence in re.split(r'(?<=[^A-Z].[.?]) +(?=[A-Z])', document.replace("\n", ' ')):
length += sentence.count(" ") # Rough estimate of full word tokens
if length < (token_max_length * 0.9): # Add a buffer since these tokens are unlikely to be the same as a LM
sent.append(sentence)
else:
nested.append(sent)
sent = [sentence]
length = 0
if sent:
nested.append(sent)
return nested
Hereās a version using a specific transformer models.py Ā· pleonova/multi-label-summary-text at main
Thank you very much for enriching this post so much with your deep review!
Hi,
I have a large medical record document (Attending medical statement) which has a lot of tables. I need to create a summary document for this large PDF, any idea which model would be the best for this and what approach should I follow.