Compressing, saving, and loading datasets

vblagoje · October 30, 2020, 12:15pm

Hello everyone,

I am working with large datasets (Wikipedia), and use map transform to create new datasets. The workflow involves creating new datasets that are saved using save_to_disk, and subsequently, I use terminal compression utils to compress the dataset folder. Then I decompress these files and the use load_from_disk to load them on other machines. These manual steps are pita.

It would be great to use compression within datasets and have one compressed file as a result of save_to_disk if so desired.
If I could save these datasets to a remote location immediatelly bypassing save_to_disk, compression, copy manual steps that would be amazing.
Loading these created datasets via URL from s3, gs, etc. via a single load_dataset call would be a killer.

All the best,
Vladimir

lhoestq · October 30, 2020, 2:24pm

I agree that would be super cool to be able to archive and save/load archived dataset from/to a cloud storage. We’re thinking about this actively. Do you think that some dataset versioning logic could be interesting as well ?

vblagoje · October 30, 2020, 4:09pm

Not important for me. But maybe I am overlooking it. How would you use versioning? Like git for datasets?

lhoestq · November 10, 2020, 3:58pm

Yes maybe something similar to git lfs, directly integrated in the library. We’re already doing that for models in the transformers library (since today’s migration).

Topic		Replies	Views
Extend load_from_disk and save_to_disk to remote storage 🤗Datasets	3	523	October 12, 2020
Support of very large dataset? 🤗Datasets	12	10345	August 24, 2022
How to save datasets as distributed with save_to_disk? 🤗Datasets	1	2463	November 15, 2022
How do I download and load a dataset in batches without caching all of it? 🤗Datasets	1	226	September 16, 2024
Question about streaming 🤗Datasets	3	573	April 25, 2023

Compressing, saving, and loading datasets

Related topics