Create the refs/convert/parquet branch of a script-based dataset to get the viewer

Natooz · July 31, 2024, 9:19am

Hello,

I plan to release a dataset (Metacreation/GigaMIDI, currently gated) which needs to be loaded from a script.
For compatibility reasons, I want to the main branch to contain the data as shards of tar files (webdatasets) along with json files for the metadata that will be loaded by the script, like the voxpopuli dataset.

I can’t get the parquet converter bot to automatically create the parquet version, and the parquet conversion script from the cli doesn’t do what I want.
I tried to do this manually by converting the splits to parquet files and uploading them to the specific branch, but it’s impossible to create the refs/convert/parquet branch.

What can I do to make this possible? There is nothing in the docs to cover this use case.
How did you manage to create the parquet branches for the voxpopuli or mozilla-foundation/common_voice_11_0 datasets?

severo · August 1, 2024, 8:31am

cc @albertvillanova

Natooz · August 6, 2024, 2:21pm

Bumping this @severo @albertvillanova
I feel that it’s just a matter of a command to run or something minor that I can’t get my hands on.
Hopefully the docs could be updated accordingly

Natooz · October 9, 2024, 12:15pm

@albertvillanova
Do you have an idea of what can be the reason why the parquet conversion does not occur?
Should the dataset be non-script-based at all? (so already in parquet as it’s a multimodal dataset)

severo · October 10, 2024, 8:22am

cc @lhoestq @albertvillanova

Natooz · October 10, 2024, 12:14pm

FIY I ended up manually converting the dataset to parquet, and release it as it. At least everything (viewer, streaming…) works OOTB. Actually the bot converted it to parquet anyway this time even though it’s already in parquet.

The reason I initially wanted to release it as webdataset zip files + script was to allow users outside of the HF datasets library ecosystem to just download the files and use them with whatever library they want. These potential users now would just have to iterate over the Dataset and write files of the bytes in each row. No big deal, just thought the feedback could be valuable to you.

lhoestq · October 11, 2024, 1:48pm

Hi Natooz ! That’s great to hear, Parquet is a nice option to allow people to load the data via pandas/dask/spark/datasets/duckdb and many other libraries.

Also FYI with webdataset you could also get the viewer and streaming OOTB, but not with a loading script (we stopped supporting that because of obvious security reasons). Just wondering why webdataset alone couldn’t be a good fit ?

Natooz · October 11, 2024, 6:31pm

The dataset would have been composed of the webdataset part (zip files containing music files) and separate json files for the associated metadata, thus needing a script to build it, unless I am missing something.

lhoestq · October 14, 2024, 12:19pm

In WebDataset you can group the audio data and the associated metadata together, see the docs here: WebDataset. You just need to have one metadata file per audio file and use the same filename prefix !

For example each sample can be one .wav and one .json, and the webdataset library will load them together to yield dictionaries like {"wav": (waveform, sampling_rate), "json": metadata}

Natooz · October 14, 2024, 1:03pm

Didn’t know that, thank you!
I somehow missed this part of the docs (or have just been lazy copying existing datasets from the hub ).
I hope it will help people reading this thread!

Topic		Replies	Views
Enabling dataset viewer by coexistence of loading script and parquet files 🤗Datasets	5	327	March 18, 2024
Cant save Dataset as Parquet-File since Updating Datasets? 🤗Datasets	1	2472	May 1, 2021
Dataset Preview error with a dataset script and parquet files 🤗Datasets	4	716	April 3, 2024
Parquet-bot converted a parquet file into a bigger parquet chunk 🤗Datasets	2	154	June 14, 2024
How to convert dir-with-images properly? 🤗Datasets	2	448	June 11, 2024

Create the refs/convert/parquet branch of a script-based dataset to get the viewer

Related topics