Accomplished by @sshleifer, @valhalla and yours truly, we have now released 11 evaluation datasets used for evaluating pegasus
. For transformers
API see Pegasus.
The datasets are available directly from s3. All the build instructions/scripts are available as well.
You will find them in this repo.
The list includes:
- aeslc
- arxiv
- big_patent
- billsum
- cnn_dailymail
- gigaword
- multi_news
- newsroom
- pubmed
- reddit_tifu
- wikihow
- xsum
We couldn’t figure out how to build the big_patent
dataset - as we couldn’t build this arrow dataset, see https://github.com/google-research/pegasus/issues/114 - if you can help to build it, it’d be amazing!