Increased arrow table size by factor of ~2

I think the best is to push to the Hugging face Hub using my_dataset.push_to_hub("my_username/my_dataset_name"). It can even be saved as a private dataset if you pass private=True.

This way the dataset is saved as multiple compressed parquet files (max 500MB each by default). And you can reload this dataset using load_dataset. It will be much faster than download uncompressed Arrow data from save_to_disk

And if you want you can even reload the dataset in streaming mode (streaming=True) if you don’t want to download everything, but download on-the-fly when iterating over the dataset. It’s pretty convenient for datasets of this size.

(PS: If you still want to use save_to_disk, note that a PR is open and almost ready to merge that adds the num_shards= parameter to save_to_disk)