Standard way to upload huge dataset

I have a huge (100GB+) dataset of audio (.wav files) and its respective metadata I was able to easily load the dataset using load_dataset and uploaded it using push_to_hub which converts it to a parquet file what is the best way to upload such large dataset (particularly images and audio) I want to be able to use streaming with it And update metadata without having to reupload the entire dataset

cc @lhoestq @mariosasko

Parquet is generally for frozen datasets, if you wish to modify metadata you can structure your dataset as ImageFolder with metadata, see the docs at Image Dataset

1 Like

ok i didn’t know metadata can be relative, so ya this works
but when ds.push_to_hub(…) is used it converts it into parquet, I have no problem with the image/audio being converted to that format, but I would like to have the ability to modify the metadata even after the dataset has been uploaded to the hub.

Any way to make that happen?

another question is will i be able to use “streaming” with .tar files ?

You can stream .tar files if they are in the WebDataset format, see the webdataset docs on HF