Wikihow dataset preprocessing?

Has anyone managed to use the code in examples/seq2seq on the wikihow dataset used in the pegasus paper?

If you have a working data dir (one line per example for source and target), but no preprocessing code, I would find a google drive/s3 link to the data useful! Thanks!
If you have preprocessing code, even more useful!

Here you go Sam @sshleifer

raw article files

gdown -O train_articles.zip --id 1-1CR6jh6StaI69AsbBXD8lQskFbGc2Ez # train
gdown -O valid_articles_.zip --id 1-EGoT5ZKRNHQb_ewNpD9GZCvQ3uHzDSi # val
gdown -O test_articles_.zip --id 1-CxzdzEIuBYzCs06zrglYrLBlLI6kjSZ # test 

unzip these

pre-proc code (not super readable, wrote it a while ago)

import os
import glob
import json
import re

def get_art_abs_wikihow(path):
  articles = glob.glob('%s/*' % path)
  for a in articles:
    try:
      with open(a, 'r') as f:
        text = f.read()
      splits  = text.split('@article')
      abstract = splits[0].replace('\n', '').replace('@summary', '').strip()
      article = splits[1].replace('\n', '').replace('@article', '').strip()
      yield article, abstract
    except Exception as e:
      yield None

def write_to_bin(lines, out_prefix):
  print("Making bin file for %s..." % out_prefix)

  with open(out_prefix + '.source', 'at') as source_file, open(out_prefix + '.target', 'at') as target_file:
    for idx,line in enumerate(lines):
      if idx % 1000 == 0:
        print("Writing story %i" % idx)

      # Get the strings to write to .bin file
      if line is None: continue
      article, abstract = line

      #  a threshold is used to remove short articles with long summaries as well as articles with no summary
      if len(abstract) < (0.75*len(article)):
        # remove extra commas in abstracts
        abstract = abstract.replace(".,",".")
        # remove extra commas in articles
        article = article.replace(";,", "")
        article = article.replace(".,",".")
        article = re.sub(r'[.]+[\n]+[,]',".\n", article)
        
        abstract = abstract.strip().replace("\n", "")
        article  = article.strip().replace("\n", "")

        # Write article and abstract to files
        source_file.write(article + '\n')
        target_file.write(abstract + '\n')

  print("Finished writing files")

def create_stories(save_path='wikihow'):
  # Create some new directories
  if not os.path.exists(save_path): os.makedirs(save_path)

  # write wikihow
  print("Making bin file for wikihow valid set")
  lines = get_art_abs_wikihow('./valid_articles')
  write_to_bin(lines, os.path.join(save_path, "val"))

  print("Making bin file for wikihow train set")
  lines = get_art_abs_wikihow('./train_articles')
  write_to_bin(lines, os.path.join(save_path, "train"))

  print("Making bin file for wikihow test set")
  lines = get_art_abs_wikihow('./test_articles')
  write_to_bin(lines, os.path.join(save_path, "test"))

and processed data using above script. (one line per example)

gdown -O wikihow.zip --id 1_QE1PLJhhugMf2e1edUGJRMiktKsm8YU
2 Likes

Thanks! cc @stas

1 Like

@valhalla do you have any of the other datasets in this table that we haven’t scored covered? Trying to avoid duplicating work :slight_smile:

@sshleifer I have arxiv

gdown --id 1K2kDBTNXS2ikx9xKmi2Fy0Wsc5u_Lls0 --output arxiv.zip

not pre-processed yet, will process and post a link here. It’s in jsonline format, each line is a json with keys abstract_text, article_text

pinging @anthonyfuller, Anthony is a summrization master, he might have other datasets pre-processed

2 Likes

@valhalla Haha, I wish, brother :slightly_smiling_face:

Unfortunately I recently deleted my preprocessed sum datasets. Sorry I can’t help.