How to load, process and upload 300GB of text
You can actually do it on all your text files at once.
Indeed datasets
can work on datasets bigger than memory, since it memory maps the datasets files from your disk.
- load the dataset using
load_dataset
, e.g.
ds = load_dataset("text", data_files=list_of_text_files, split="train")
- use
map
to clean your dataset (usenum_proc
to make it faster), e.g.
ds = ds.map(clean_document, num_proc=num_proc)
- upload with
push_to_hub
, e.g.
ds.push_to_hub("316usman/my_text_dataset")
An alternative to upload iteratively anyway
There’s no “append” mode yet in push_to_hub
, but you can execute the above steps once for each file as a separate split:
ds.push_to_hub("316usman/my_text_dataset", split=f"part_{i:05d}")
and then regroup all the splits together by modifying the YAML at the top of the README.md in the dataset repository on HF:
from
configs:
- config_name: default
data_files:
- split: part_0000
path: data/part_0000-*
- split: part_0001
path: data/part_0001-*
...
to
configs:
- config_name: default
data_files:
- split: train
path: data/part_*