How to improve summarization?

This might bit a tricky question (because summarization is difficult), but consider the example shown in the documentation

summarizer("An apple a day, keeps the doctor away", min_length=5, max_length=20)
Your max_length is set to 20, but you input_length is only 13. You might consider decreasing max_length manually, e.g. summarizer('...', max_length=50)
Out[49]: [{'summary_text': 'apple a day keeps the doctor away from the doctor.'}]

I find the summarized text a bit… interesting :smile: Granted, this was a very short example. But let’s consider a more real-life example taken from the NYT:

mytext = ‘’’
The highly contagious Delta variant is now responsible for almost all new Covid-19 cases in the United States, and cases are rising rapidly. For the first time since February, there were more than 100,000 confirmed cases on Tuesday, the same day the Centers for Disease Control and Prevention recommended that vaccinated people should resume wearing masks in public indoor spaces in communities where the virus is surging.

That updated guidance was based in part on a new internal report that cited evidence that vaccinated people experiencing breakthrough infections of the Delta variant, which remain infrequent, may be as capable of spreading the virus as infected unvaccinated people.

Several studies, including ones referenced in the C.D.C.’s presentation, have shown that vaccines remain effective against the Delta variant, particularly against hospitalization and death. That has held true in the real world: About 97 percent of those recently hospitalized by the virus were unvaccinated, the C.D.C. said. But in counties where vaccination rates are low, cases are rising fast, and deaths are also on the rise.

summarizer(mytext, min_length=5, max_length=20)
Out[51]: [{'summary_text': 'the highly contagious variant is now responsible for almost all new cases in the united states '}]

But… this is just the first sentence of the whole paragraph almost verbatim! What do you think we can do to improve the output of the summarization pipeline? Can summarization be trained?


A funny result from NLP work on summarization is that the first sentence of a news article usually turns out to be a pretty challenging baseline to beat :grinning_face_with_smiling_eyes: This isn’t necessarily a bad thing—it’s great for us as human readers that news writers do this!

Summarization is an active area of research, with both supervised and unsupervised training approaches. If you check out NAACL 2021, the most recent NLP conference (the very latest, ACL 2021, starts today), there are twenty-seven papers about summarization!

1 Like

haha interesting indeed @mbforbes ! I will look at the paper. But do you know how you would go about fine-tuning a summarization model with huggingface? Would that work the same as for text-classification?