I want to summarize the T&Cs and privacy policies of various services. I’ve decided to do it via a hybrid approach where I initially pre-process the terms or policies and try to remove as many legalese/complex words as possible.
Next, I would like to use a pre-trained model for the actual summarization where I would give the simplified text as an input.
I wanna utilize either the second or the third most downloaded transformer( sshleifer / distilbart-cnn-12-6 or the google / pegasus-cnn_dailymail) whichever is easier for a beginner / explain for you.
I already tried out the default pipeline.
summarizer = pipeline(‘summarization’) and got back a summary for a paragraph of the T&C of Instagram.
I tried using the Pegasus model following this tutorial and got “RuntimeError: CUDA out of memory” where I ran out of memory on my GPU.
Thank you for your valuable time and help
Do you have any concrete questions though? Where exactly are you stuck?
Regarding the out of memory error - Have you tried decreasing the batch size or using a smaller model?
I wouldn’t say any transformer is “easier” or harder. That’s what’s beautiful about huggingface, it gives you access to many models through one API. Different kinds of models may have different needs but I wouldn’t say there are easier and harder models, as a lot of the complexity is abstracted away by huggingface.
One more thing I think I’d try is not to remove the legalese. Usually those are the important parts. Wouldn’t it be awesome if your model included readable summaries of that stuff?
If you have examples of T&Cs and summaries, then you could fine tune any model designed for that task, or you could use an EncoderDecoderModel as explained here: Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models.
If you don’t have any training data I’d still leave the legalese in and just see what the result looks like. It might still be okay.
T&Cs are usually long though, are you currently just truncating the input (most models I’ve come across have a max input length of 512)? This is something I’m trying to solve myself right now.
Disclaimer: I’m fairly new to this myself.
Hi, thank you for the reply and advice.
I forgot to mention that I want the summary to be simplistic as possible so even the average Joe would understand them. Hence that’s why I’m trying to clean up the legalese before feeding it to the summarizer.
So for the memory issue - I tried it via Google Collab with GPU and tried to utilize the Pegasus model.
Upon reaching this line - tokenizer = AutoTokenizer.from_pretrained(“google/pegasus-cnn_dailymail”, use_fast=False)
I got an error stating
"ValueError: Couldn’t instantiate the backend tokenizer from one of: (1) a
tokenizers library serialization file, (2) a slow tokenizer instance to convert or (3) an equivalent slow tokenizer class to instantiate and convert. You need to have sentencepiece installed to convert a slow tokenizer to a fast one.
Then I found the docs and put use_fast=False and it didn’t work.
I also updated to the latest version of PIP(pip-21.0.1) - still the same error
I also downloaded this sentencepiece (Successfully installed sentencepiece-0.1.91) - still the same error persisted
Not a solution, but note that after the installation of sentencepiece (0.1.95) the error message changes to:
PegasusConverter requires the protobuf library but it was not found in your environment. Checkout the instructions on the
installation page of its repo: https://github.com/protocolbuffers/protobuf/tree/master/python#installation and follow the ones
that match your environment.
So if you want to use that specific tokenizer, you should probably install protobuf.
This is, however, not related to your GPU memory issues. As @neuralpat pointed out, the memory issue should be tackled by decreasing the batch size.
Thanks for the reply
Utilizing Google collab’s GPU, I got the Pegasus model to work but reinstalling Transformers w/ that senterpiece and putting use_fast=False
I didn’t get that protobuf error, however. I was using PyCharm but I’m planning on creating a Google App Engine project and then bringing Pegasus over.