Streaming in dataset uploads

Hi all, first time posting, long time user. I’m processing a large (6TB) dataset from the hub, through streaming. I’d like to upload the resulting dataset through something like streaming, is this possible? My current solution is hundreds of small parts, instead of the normal train/val/test, but this obviously isn’t ideal.

If anyone has come across this issue before, I’d love to hear about the solution you found!

1 Like

With 6TB, it’s not impossible to download, but it’s certainly better to be able to handle it with streaming…
If you don’t want to make too many changes to the contents of the dataset, you can write a script or builder class for loading the dataset and upload it to the repo. You can then load it by setting trust_remote_code=True.
Also, if the upload target is a media file, etc., the approach using WebDataset may be suitable for segmented uploads. @lhoestq

Loading large dataset

Saving large dataset

Building dataset

WebDataset

Troubleshooting for dataset

Hi ! Currently IterableDataset still needs a push_to_hub() method, which would be very welcome (it’s open to contributions).

Note that in the meantime, you can still merge your small parts into train/val/test in the YAML part of the README.md file

1 Like