Uploading large datasets iteratively

How to load, process and upload 300GB of text

You can actually do it on all your text files at once.
Indeed datasets can work on datasets bigger than memory, since it memory maps the datasets files from your disk.

  1. load the dataset using load_dataset, e.g.
ds = load_dataset("text", data_files=list_of_text_files, split="train")
  1. use map to clean your dataset (use num_proc to make it faster), e.g.
ds = ds.map(clean_document, num_proc=num_proc)
  1. upload with push_to_hub, e.g.
ds.push_to_hub("316usman/my_text_dataset")

An alternative to upload iteratively anyway

There’s no “append” mode yet in push_to_hub, but you can execute the above steps once for each file as a separate split:

ds.push_to_hub("316usman/my_text_dataset", split=f"part_{i:05d}")

and then regroup all the splits together by modifying the YAML at the top of the README.md in the dataset repository on HF:

from

configs:
- config_name: default
  data_files:
  - split: part_0000
    path: data/part_0000-*
  - split: part_0001
    path: data/part_0001-*
  ...

to

configs:
- config_name: default
  data_files:
  - split: train
    path: data/part_*
1 Like