Uploading large datasets iteratively

316usman · October 28, 2023, 2:42pm

Hi, I am trying to upload about 300GB of text. First I would like to do some processing on the text and then upload to the dataset for e.g I open a text file and make it clean and then upload it and then do this with the next file. How do i do this iteratively?

To use .push_to_hub you need to have added a new item to the dataset that would require alot of ram.

lhoestq · October 29, 2023, 3:25pm

How to load, process and upload 300GB of text

You can actually do it on all your text files at once.
Indeed datasets can work on datasets bigger than memory, since it memory maps the datasets files from your disk.

load the dataset using load_dataset, e.g.

ds = load_dataset("text", data_files=list_of_text_files, split="train")

use map to clean your dataset (use num_proc to make it faster), e.g.

ds = ds.map(clean_document, num_proc=num_proc)

upload with push_to_hub, e.g.

ds.push_to_hub("316usman/my_text_dataset")

An alternative to upload iteratively anyway

There’s no “append” mode yet in push_to_hub, but you can execute the above steps once for each file as a separate split:

ds.push_to_hub("316usman/my_text_dataset", split=f"part_{i:05d}")

and then regroup all the splits together by modifying the YAML at the top of the README.md in the dataset repository on HF:

from

configs:
- config_name: default
  data_files:
  - split: part_0000
    path: data/part_0000-*
  - split: part_0001
    path: data/part_0001-*
  ...

to

configs:
- config_name: default
  data_files:
  - split: train
    path: data/part_*

316usman · October 29, 2023, 9:08pm

@lhoestq Thank you so much for replying

What I get from your first approach is: Datasets library does not load the files into the RAM rather they are stored on the Disk (HDD/SSD). Please correct me if I am wrong.

Also I came across the from_generator() class Now my approach is that I download a file from a block storage and process it using the data generator function, in this case where would the data be store until i have used push_to_hub() on the disk perhaps?.

The alternative approach looks promising I will try that out too. In the alternative approach I would be uploading text files? (With each row in my file representing an entry in the dataset).

Again thanks so much for replying. Hope to connect with you.

lhoestq · October 30, 2023, 9:43am

Both load_dataset and from_generator write the dataset on disk and then memory maps the data on disk to not fill out the RAM (you can see it as using virtual memory from your disk).

The alternative approach looks promising I will try that out too. In the alternative approach I would be uploading text files? (With each row in my file representing an entry in the dataset).

It uploads Parquet files. Parquet a compressed columnar format that is particularly suited to store all sorts of datasets and is easy to load in python

316usman · October 30, 2023, 7:20pm

Thank you so much I’ll try and get back to you

Topic		Replies	Views
Uploading a dataset that doesn't fit in memory to the HF hub 🤗Datasets	5	73	October 24, 2024
Big text dataset loading for training 🤗Datasets	2	96	May 7, 2025
I had collected data for a language text for translation How can I add it up into datsets 🤗Datasets	7	1576	August 23, 2021
Incrementally adding processed examples to a dataset 🤗Datasets	4	1386	June 23, 2022
Streaming in dataset uploads 🤗Datasets	2	52	March 31, 2025

Uploading large datasets iteratively

How to load, process and upload 300GB of text

An alternative to upload iteratively anyway

Related topics