Increased arrow table size by factor of ~2

lhoestq · November 24, 2022, 10:52am

I think the best is to push to the Hugging face Hub using my_dataset.push_to_hub("my_username/my_dataset_name"). It can even be saved as a private dataset if you pass private=True.

This way the dataset is saved as multiple compressed parquet files (max 500MB each by default). And you can reload this dataset using load_dataset. It will be much faster than download uncompressed Arrow data from save_to_disk

And if you want you can even reload the dataset in streaming mode (streaming=True) if you don’t want to download everything, but download on-the-fly when iterating over the dataset. It’s pretty convenient for datasets of this size.

(PS: If you still want to use save_to_disk, note that a PR is open and almost ready to merge that adds the num_shards= parameter to save_to_disk)

Topic		Replies	Views
Creating HuggingFace Dataset from PyArrow table is slow 🤗Datasets	1	85	December 11, 2024
LoadDataSet pyarrow.lib.ArrowCapacityError 🤗Datasets	6	221	January 12, 2025
Local dataset loading performance: HF's arrow vs torch.load 🤗Datasets	5	1169	November 24, 2024
Recommended max size of dataset? 🤗Datasets	5	132	March 11, 2025
Custom 20GB Arrow dataset very slow to train Beginners	1	63	March 20, 2025

Increased arrow table size by factor of ~2

Related topics