Compressing, saving, and loading datasets

Hello everyone,

I am working with large datasets (Wikipedia), and use map transform to create new datasets. The workflow involves creating new datasets that are saved using save_to_disk, and subsequently, I use terminal compression utils to compress the dataset folder. Then I decompress these files and the use load_from_disk to load them on other machines. These manual steps are pita.

  1. It would be great to use compression within datasets and have one compressed file as a result of save_to_disk if so desired.

  2. If I could save these datasets to a remote location immediatelly bypassing save_to_disk, compression, copy manual steps that would be amazing.

  3. Loading these created datasets via URL from s3, gs, etc. via a single load_dataset call would be a killer.

All the best,
Vladimir

1 Like

I agree that would be super cool to be able to archive and save/load archived dataset from/to a cloud storage. We’re thinking about this actively. Do you think that some dataset versioning logic could be interesting as well ?

1 Like

Not important for me. But maybe I am overlooking it. How would you use versioning? Like git for datasets?

Yes maybe something similar to git lfs, directly integrated in the library. We’re already doing that for models in the transformers library (since today’s migration).