GPT2 summarization performance

Has anyone run benchmark studies to evaluate the generation/summarization performance of GPT2 on datasets such as “xsum” ? If so could you share the performance numbers (in-terms of ROUGE scores) you got? I search for these results online, but couldn’t find any.


I can suggest starting looking here:

I haven’t found neither to be honest. But!

As I believe it seems smarter to use encoder-decoder style models (like PEGASUS or BART) for summarization. Decoder-only language models like GPT were not only trained to continue texts (unlike ~BART), but also they don’t extract the idea of the given texts. Though encoder can be interpreted as an “idea extractor” and the decoder as the generator for natural language text.

P.S. I know one paper that tries to prove the point on summarizing using GPT3 is better than using BART. Even though it heavily relies on Russian language in experiments - you can use the references in the paper to look deeper and find what you are looking for. Paper arxiv link:

Good luck and let me know if you find anything,

Thank you Kirill, for sharing the pointers. I agree with you that BART and PEGASUS are better for text summarization, over decoder only models. However, I was curious if some one had experimented with GPT2 variants for text generation. I found some sample implementations online, but no metrics on the performance evaluation on standard datasets. I feel it is also not straightforward to run inference (e.g., summaries) for text generation using GPT2. Some caveats such as penalizing long summaries, using special tokens to use a decoder only model for training & inference and not-immediately-obvious decoding strategies make inference tricky IMO. I ran a few tests and found the performance to be way below-par, contrary to some claims made in papers about the obviousness of achieving performance improvement on supervised tasks using GPT2 style models, esp., for text generation. Nevertheless, I will keep looking and update this thread if I find any relevant articles or if I find a robust way of doing summary generation using GPT2.

Any updates on that?

Do we need to conclude that decoder-only models are not suited for text summarization?

And how does it apply to GPT-3 – the latter seems to summarize quite well – do we expect a 175B parameter seq2seq model to perform considerably better?