I have a huge (100GB+) dataset of audio (.wav files) and its respective metadata I was able to easily load the dataset using load_dataset
and uploaded it using push_to_hub
which converts it to a parquet file what is the best way to upload such large dataset (particularly images and audio) I want to be able to use streaming with it And update metadata without having to reupload the entire dataset
Parquet is generally for frozen datasets, if you wish to modify metadata you can structure your dataset as ImageFolder with metadata, see the docs at Image Dataset
1 Like
ok i didn’t know metadata can be relative, so ya this works
but when ds.push_to_hub(…) is used it converts it into parquet, I have no problem with the image/audio being converted to that format, but I would like to have the ability to modify the metadata even after the dataset has been uploaded to the hub.
Any way to make that happen?
another question is will i be able to use “streaming” with .tar files ?
You can stream .tar files if they are in the WebDataset format, see the webdataset docs on HF