Increased arrow table size by factor of ~2

I have a dataset that consists of 50,000, 125 x 64 examples with type float32 specified in the hf dataset features. When I save this file the arrow table is 3.7GB. This is just 1/1,600 chunks, approx 6TB total.

I would expect this to be significantly smaller 1.47GB (50,000 * 8,000 / (1028**3)). Iā€™d expect thereā€™s some overhead associated with storing an arrow table but not almost twice the expected size. This means the entire dataset should really be 3TB. Iā€™m guessing there is some really low hanging optimization fruit to be had here.

The data is being stored in GCS, the (potentially) larger than needed file sizes theoretically doubles the amount of time it takes for me to load my data onto a VM, I can go route of multiprocessing the downloads but this require more cores but at the end of the day its a wash (less cores = more time x cheaper) and (more cores = less time x more expensive).

This fundamental question here, which I have seen being asked a ton across this forum and github issues, is what is the best way to get TB sized datasets that you created into the hugging face datasets ecosystem when the goal is to access that dataset on a VM, as it almost always.

My current solution (for posterity and looking for improvements)

  • Chunk the dataset (this is the 3.7GB)
  • Iterate over the chunks and turn each example into a ā€œpa.arrayā€ using TypedSequence(np.array(example), 2DArray((125, 64), type=ā€˜float32ā€™)
  • Turn pa.arrays into pa.table using pa.table.from_arrays() using the pyarrow API
  • pass those tables to Dataset constructor and save_to_disk
  • upload chunks to GCS
  • download chunks onto VM (I found that using gcs python api to download/upload is much faster than "load_from_disk(ā€œgcs//:ā€¦ā€)
  • full dataset is now in VMā€™s local file system, load into training script with
chunked_datasets = [load_from_disk(p) for p in paths_to_chunks]
dataset = concatenate_datasets(chunked_datasets) 

Any further optimizations would be greatly appreciated!

Hi ! Currently the Array2D type uses a storage that is not optimized in terms of space. It stores tons of unnecessary offsets in the arrays. Weā€™d like to change this and end up with the sizes youā€™re expecting.

There is a discussion here and a PR here (though it was not finished) if youā€™re interested in following the advancements. If youā€™d like to contribute Iā€™d also be happy to give you some pointers :slight_smile:

Thanks for the response! Any thoughts on the answer to the bolded question?

I would be happy to contribute, Iā€™ll look further into what this would entail and would enjoy getting some pointers at that stage.

I think the best is to push to the Hugging face Hub using my_dataset.push_to_hub("my_username/my_dataset_name"). It can even be saved as a private dataset if you pass private=True.

This way the dataset is saved as multiple compressed parquet files (max 500MB each by default). And you can reload this dataset using load_dataset. It will be much faster than download uncompressed Arrow data from save_to_disk

And if you want you can even reload the dataset in streaming mode (streaming=True) if you donā€™t want to download everything, but download on-the-fly when iterating over the dataset. Itā€™s pretty convenient for datasets of this size.

(PS: If you still want to use save_to_disk, note that a PR is open and almost ready to merge that adds the num_shards= parameter to save_to_disk)

Greatly appreciate the help here!

The dataset is large to have all on disk at once. What the best way to ā€œpush_to_hubā€ incrementally, if I keep using the same push_to_hub fn it will upload the incremental data but the delete what was there prior.

If this is not possible currently, would a PR that enables toggling the deletion of files in ā€œ_push_parquet_shards_to_hubā€ work?

Is there any sort of functionality that allows the next partition to be downloaded in the background so that training doesnā€™t pause to download? This could very well already exist and Iā€™m just missing it.

Appending to an existing dataset isnā€™t implemented yet. Though itā€™s already possible to export a dataset to a parquet file using .to_parquet() and then use the huggingface_hub package to upload the file to the Hub. You can upload multiple parquet files and reload the full dataset using load_dataset.

datasets itself has no native prefetching yet, but you a pytorch dataloader to enable prefetching on a streaming dataset :slight_smile: