GPT2 summarization performance

Has anyone run benchmark studies to evaluate the generation/summarization performance of GPT2 on datasets such as “xsum” ? If so could you share the performance numbers (in-terms of ROUGE scores) you got? I search for these results online, but couldn’t find any.


As I believe it seems smarter to use encoder-decoder style models (like PEGASUS or BART) for summarization. Decoder-only language models like GPT were not only trained to continue texts (unlike ~BART), but also they don’t extract the idea of the given texts. Though encoder can be interpreted as an “idea extractor” and the decoder as the generator for natural language text.

I know one paper that tries to prove the point on summarizing using GPT3 is better than using BART. Even though it heavily relies on Russian language in experiments - you can use the references in the paper to look deeper and find what you are looking for.

Thank you Kirill, for sharing the pointers. I agree with you that BART and PEGASUS are better for text summarization, over decoder only models. However, I was curious if some one had experimented with GPT2 variants for text generation. I found some sample implementations online, but no metrics on the performance evaluation on standard datasets. I feel it is also not straightforward to run inference (e.g., summaries) for text generation using GPT2. Some caveats such as penalizing long summaries, using special tokens to use a decoder only model for training & inference and not-immediately-obvious decoding strategies make inference tricky IMO. I ran a few tests and found the performance to be way below-par, contrary to some claims made in papers about the obviousness of achieving performance improvement on supervised tasks using GPT2 style models, esp., for text generation. Nevertheless, I will keep looking and update this thread if I find any relevant articles or if I find a robust way of doing summary generation using GPT2.