[pegasus] evaluation datasets and build scripts are now available

Accomplished by @sshleifer, @valhalla and yours truly, we have now released 11 evaluation datasets used for evaluating pegasus . For transformers API see Pegasus.

The datasets are available directly from s3. All the build instructions/scripts are available as well.

You will find them in this repo.

The list includes:

  • aeslc
  • arxiv
  • big_patent
  • billsum
  • cnn_dailymail
  • gigaword
  • multi_news
  • newsroom
  • pubmed
  • reddit_tifu
  • wikihow
  • xsum

We couldn’t figure out how to build the big_patent dataset - as we couldn’t build this arrow dataset, see https://github.com/google-research/pegasus/issues/114 - if you can help to build it, it’d be amazing!