Streaming in dataset uploads

With 6TB, it’s not impossible to download, but it’s certainly better to be able to handle it with streaming…
If you don’t want to make too many changes to the contents of the dataset, you can write a script or builder class for loading the dataset and upload it to the repo. You can then load it by setting trust_remote_code=True.
Also, if the upload target is a media file, etc., the approach using WebDataset may be suitable for segmented uploads. @lhoestq

Loading large dataset

Saving large dataset

Building dataset

WebDataset

Troubleshooting for dataset