I think the best is to push to the Hugging face Hub using my_dataset.push_to_hub("my_username/my_dataset_name")
. It can even be saved as a private dataset if you pass private=True
.
This way the dataset is saved as multiple compressed parquet files (max 500MB each by default). And you can reload this dataset using load_dataset
. It will be much faster than download uncompressed Arrow data from save_to_disk
And if you want you can even reload the dataset in streaming mode (streaming=True) if you don’t want to download everything, but download on-the-fly when iterating over the dataset. It’s pretty convenient for datasets of this size.
(PS: If you still want to use save_to_disk
, note that a PR is open and almost ready to merge that adds the num_shards=
parameter to save_to_disk
)