Uploading large datasets iteratively

Hi, I am trying to upload about 300GB of text. First I would like to do some processing on the text and then upload to the dataset for e.g I open a text file and make it clean and then upload it and then do this with the next file. How do i do this iteratively?

To use .push_to_hub you need to have added a new item to the dataset that would require alot of ram.

How to load, process and upload 300GB of text

You can actually do it on all your text files at once.
Indeed datasets can work on datasets bigger than memory, since it memory maps the datasets files from your disk.

  1. load the dataset using load_dataset, e.g.
ds = load_dataset("text", data_files=list_of_text_files, split="train")
  1. use map to clean your dataset (use num_proc to make it faster), e.g.
ds = ds.map(clean_document, num_proc=num_proc)
  1. upload with push_to_hub, e.g.
ds.push_to_hub("316usman/my_text_dataset")

An alternative to upload iteratively anyway

Thereā€™s no ā€œappendā€ mode yet in push_to_hub, but you can execute the above steps once for each file as a separate split:

ds.push_to_hub("316usman/my_text_dataset", split=f"part_{i:05d}")

and then regroup all the splits together by modifying the YAML at the top of the README.md in the dataset repository on HF:

from

configs:
- config_name: default
  data_files:
  - split: part_0000
    path: data/part_0000-*
  - split: part_0001
    path: data/part_0001-*
  ...

to

configs:
- config_name: default
  data_files:
  - split: train
    path: data/part_*
1 Like

@lhoestq Thank you so much for replying

What I get from your first approach is: Datasets library does not load the files into the RAM rather they are stored on the Disk (HDD/SSD). Please correct me if I am wrong.

Also I came across the from_generator() class Now my approach is that I download a file from a block storage and process it using the data generator function, in this case where would the data be store until i have used push_to_hub() on the disk perhaps?.

The alternative approach looks promising I will try that out too. In the alternative approach I would be uploading text files? (With each row in my file representing an entry in the dataset).

Again thanks so much for replying. Hope to connect with you.

Both load_dataset and from_generator write the dataset on disk and then memory maps the data on disk to not fill out the RAM (you can see it as using virtual memory from your disk).

The alternative approach looks promising I will try that out too. In the alternative approach I would be uploading text files? (With each row in my file representing an entry in the dataset).

It uploads Parquet files. Parquet a compressed columnar format that is particularly suited to store all sorts of datasets and is easy to load in python :slight_smile:

Thank you so much Iā€™ll try and get back to you