Create the refs/convert/parquet branch of a script-based dataset to get the viewer

Hello,

I plan to release a dataset (Metacreation/GigaMIDI, currently gated) which needs to be loaded from a script.
For compatibility reasons, I want to the main branch to contain the data as shards of tar files (webdatasets) along with json files for the metadata that will be loaded by the script, like the voxpopuli dataset.

I can’t get the parquet converter bot to automatically create the parquet version, and the parquet conversion script from the cli doesn’t do what I want.
I tried to do this manually by converting the splits to parquet files and uploading them to the specific branch, but it’s impossible to create the refs/convert/parquet branch.

What can I do to make this possible? There is nothing in the docs to cover this use case.
How did you manage to create the parquet branches for the voxpopuli or mozilla-foundation/common_voice_11_0 datasets?

cc @albertvillanova

Bumping this :slight_smile: @severo @albertvillanova
I feel that it’s just a matter of a command to run or something minor that I can’t get my hands on.
Hopefully the docs could be updated accordingly

@albertvillanova :wave:
Do you have an idea of what can be the reason why the parquet conversion does not occur?
Should the dataset be non-script-based at all? (so already in parquet as it’s a multimodal dataset)

cc @lhoestq @albertvillanova

FIY I ended up manually converting the dataset to parquet, and release it as it. At least everything (viewer, streaming…) works OOTB. Actually the bot converted it to parquet anyway this time even though it’s already in parquet.

The reason I initially wanted to release it as webdataset zip files + script was to allow users outside of the HF datasets library ecosystem to just download the files and use them with whatever library they want. These potential users now would just have to iterate over the Dataset and write files of the bytes in each row. No big deal, just thought the feedback could be valuable to you. :v:

1 Like

Hi Natooz ! That’s great to hear, Parquet is a nice option to allow people to load the data via pandas/dask/spark/datasets/duckdb and many other libraries.

Also FYI with webdataset you could also get the viewer and streaming OOTB, but not with a loading script (we stopped supporting that because of obvious security reasons). Just wondering why webdataset alone couldn’t be a good fit ?

The dataset would have been composed of the webdataset part (zip files containing music files) and separate json files for the associated metadata, thus needing a script to build it, unless I am missing something.

In WebDataset you can group the audio data and the associated metadata together, see the docs here: WebDataset. You just need to have one metadata file per audio file and use the same filename prefix !

For example each sample can be one .wav and one .json, and the webdataset library will load them together to yield dictionaries like {"wav": (waveform, sampling_rate), "json": metadata}

Didn’t know that, thank you!
I somehow missed this part of the docs (or have just been lazy copying existing datasets from the hub :sweat_smile:).
I hope it will help people reading this thread!