Has anyone managed to use the code in examples/seq2seq
on the wikihow dataset used in the pegasus paper?
If you have a working data dir (one line per example for source and target), but no preprocessing code, I would find a google drive/s3 link to the data useful! Thanks!
If you have preprocessing code, even more useful!
Here you go Sam @sshleifer
raw article files
gdown -O train_articles.zip --id 1-1CR6jh6StaI69AsbBXD8lQskFbGc2Ez # train
gdown -O valid_articles_.zip --id 1-EGoT5ZKRNHQb_ewNpD9GZCvQ3uHzDSi # val
gdown -O test_articles_.zip --id 1-CxzdzEIuBYzCs06zrglYrLBlLI6kjSZ # test
unzip these
pre-proc code (not super readable, wrote it a while ago)
import os
import glob
import json
import re
def get_art_abs_wikihow(path):
articles = glob.glob('%s/*' % path)
for a in articles:
try:
with open(a, 'r') as f:
text = f.read()
splits = text.split('@article')
abstract = splits[0].replace('\n', '').replace('@summary', '').strip()
article = splits[1].replace('\n', '').replace('@article', '').strip()
yield article, abstract
except Exception as e:
yield None
def write_to_bin(lines, out_prefix):
print("Making bin file for %s..." % out_prefix)
with open(out_prefix + '.source', 'at') as source_file, open(out_prefix + '.target', 'at') as target_file:
for idx,line in enumerate(lines):
if idx % 1000 == 0:
print("Writing story %i" % idx)
# Get the strings to write to .bin file
if line is None: continue
article, abstract = line
# a threshold is used to remove short articles with long summaries as well as articles with no summary
if len(abstract) < (0.75*len(article)):
# remove extra commas in abstracts
abstract = abstract.replace(".,",".")
# remove extra commas in articles
article = article.replace(";,", "")
article = article.replace(".,",".")
article = re.sub(r'[.]+[\n]+[,]',".\n", article)
abstract = abstract.strip().replace("\n", "")
article = article.strip().replace("\n", "")
# Write article and abstract to files
source_file.write(article + '\n')
target_file.write(abstract + '\n')
print("Finished writing files")
def create_stories(save_path='wikihow'):
# Create some new directories
if not os.path.exists(save_path): os.makedirs(save_path)
# write wikihow
print("Making bin file for wikihow valid set")
lines = get_art_abs_wikihow('./valid_articles')
write_to_bin(lines, os.path.join(save_path, "val"))
print("Making bin file for wikihow train set")
lines = get_art_abs_wikihow('./train_articles')
write_to_bin(lines, os.path.join(save_path, "train"))
print("Making bin file for wikihow test set")
lines = get_art_abs_wikihow('./test_articles')
write_to_bin(lines, os.path.join(save_path, "test"))
and processed data using above script. (one line per example)
gdown -O wikihow.zip --id 1_QE1PLJhhugMf2e1edUGJRMiktKsm8YU
2 Likes
@valhalla do you have any of the other datasets in this table that we haven’t scored covered? Trying to avoid duplicating work 
@sshleifer I have arxiv
gdown --id 1K2kDBTNXS2ikx9xKmi2Fy0Wsc5u_Lls0 --output arxiv.zip
not pre-processed yet, will process and post a link here. It’s in jsonline format, each line is a json
with keys abstract_text
, article_text
pinging @anthonyfuller, Anthony is a summrization master, he might have other datasets pre-processed
2 Likes
@valhalla Haha, I wish, brother 
Unfortunately I recently deleted my preprocessed sum datasets. Sorry I can’t help.