Uploading large datasets iteratively

lhoestq · October 29, 2023, 3:25pm

How to load, process and upload 300GB of text

You can actually do it on all your text files at once.
Indeed datasets can work on datasets bigger than memory, since it memory maps the datasets files from your disk.

load the dataset using load_dataset, e.g.

ds = load_dataset("text", data_files=list_of_text_files, split="train")

use map to clean your dataset (use num_proc to make it faster), e.g.

ds = ds.map(clean_document, num_proc=num_proc)

upload with push_to_hub, e.g.

ds.push_to_hub("316usman/my_text_dataset")

An alternative to upload iteratively anyway

There’s no “append” mode yet in push_to_hub, but you can execute the above steps once for each file as a separate split:

ds.push_to_hub("316usman/my_text_dataset", split=f"part_{i:05d}")

and then regroup all the splits together by modifying the YAML at the top of the README.md in the dataset repository on HF:

from

configs:
- config_name: default
  data_files:
  - split: part_0000
    path: data/part_0000-*
  - split: part_0001
    path: data/part_0001-*
  ...

to

configs:
- config_name: default
  data_files:
  - split: train
    path: data/part_*

Topic		Replies	Views
Uploading a dataset that doesn't fit in memory to the HF hub 🤗Datasets	5	81	October 24, 2024
Streaming in dataset uploads 🤗Datasets	2	69	March 31, 2025
I had collected data for a language text for translation How can I add it up into datsets 🤗Datasets	7	1582	August 23, 2021
Incrementally adding processed examples to a dataset 🤗Datasets	4	1426	June 23, 2022
Unable to upload large audio dataset using push_to_hub 🤗Datasets	5	865	November 17, 2023

Uploading large datasets iteratively

How to load, process and upload 300GB of text

An alternative to upload iteratively anyway

Related topics