I have a dataset that consists of 50,000, 125 x 64 examples with type float32 specified in the hf dataset features. When I save this file the arrow table is 3.7GB. This is just 1/1,600 chunks, approx 6TB total.
I would expect this to be significantly smaller 1.47GB (50,000 * 8,000 / (1028**3)). Iād expect thereās some overhead associated with storing an arrow table but not almost twice the expected size. This means the entire dataset should really be 3TB. Iām guessing there is some really low hanging optimization fruit to be had here.
The data is being stored in GCS, the (potentially) larger than needed file sizes theoretically doubles the amount of time it takes for me to load my data onto a VM, I can go route of multiprocessing the downloads but this require more cores but at the end of the day its a wash (less cores = more time x cheaper) and (more cores = less time x more expensive).
This fundamental question here, which I have seen being asked a ton across this forum and github issues, is what is the best way to get TB sized datasets that you created into the hugging face datasets ecosystem when the goal is to access that dataset on a VM, as it almost always.
My current solution (for posterity and looking for improvements)
Chunk the dataset (this is the 3.7GB)
Iterate over the chunks and turn each example into a āpa.arrayā using TypedSequence(np.array(example), 2DArray((125, 64), type=āfloat32ā)
Turn pa.arrays into pa.table using pa.table.from_arrays() using the pyarrow API
pass those tables to Dataset constructor and save_to_disk
upload chunks to GCS
download chunks onto VM (I found that using gcs python api to download/upload is much faster than "load_from_disk(āgcs//:ā¦ā)
full dataset is now in VMās local file system, load into training script with
chunked_datasets = [load_from_disk(p) for p in paths_to_chunks]
dataset = concatenate_datasets(chunked_datasets)
Any further optimizations would be greatly appreciated!
Hi ! Currently the Array2D type uses a storage that is not optimized in terms of space. It stores tons of unnecessary offsets in the arrays. Weād like to change this and end up with the sizes youāre expecting.
There is a discussion here and a PR here (though it was not finished) if youāre interested in following the advancements. If youād like to contribute Iād also be happy to give you some pointers
I think the best is to push to the Hugging face Hub using my_dataset.push_to_hub("my_username/my_dataset_name"). It can even be saved as a private dataset if you pass private=True.
This way the dataset is saved as multiple compressed parquet files (max 500MB each by default). And you can reload this dataset using load_dataset. It will be much faster than download uncompressed Arrow data from save_to_disk
And if you want you can even reload the dataset in streaming mode (streaming=True) if you donāt want to download everything, but download on-the-fly when iterating over the dataset. Itās pretty convenient for datasets of this size.
(PS: If you still want to use save_to_disk, note that a PR is open and almost ready to merge that adds the num_shards= parameter to save_to_disk)
The dataset is large to have all on disk at once. What the best way to āpush_to_hubā incrementally, if I keep using the same push_to_hub fn it will upload the incremental data but the delete what was there prior.
If this is not possible currently, would a PR that enables toggling the deletion of files in ā_push_parquet_shards_to_hubā work?
Is there any sort of functionality that allows the next partition to be downloaded in the background so that training doesnāt pause to download? This could very well already exist and Iām just missing it.
Appending to an existing dataset isnāt implemented yet. Though itās already possible to export a dataset to a parquet file using .to_parquet() and then use the huggingface_hub package to upload the file to the Hub. You can upload multiple parquet files and reload the full dataset using load_dataset.
datasets itself has no native prefetching yet, but you a pytorch dataloader to enable prefetching on a streaming dataset