Hello everyone,
I am working with large datasets (Wikipedia), and use map transform to create new datasets. The workflow involves creating new datasets that are saved using save_to_disk, and subsequently, I use terminal compression utils to compress the dataset folder. Then I decompress these files and the use load_from_disk to load them on other machines. These manual steps are pita.
-
It would be great to use compression within datasets and have one compressed file as a result of save_to_disk if so desired.
-
If I could save these datasets to a remote location immediatelly bypassing save_to_disk, compression, copy manual steps that would be amazing.
-
Loading these created datasets via URL from s3, gs, etc. via a single load_dataset call would be a killer.
All the best,
Vladimir