Create dataset with data stored in Zenodo

Hi,

I’m planning to publish a dataset on Zenodo and I would really like to have a connector that easily integrates the dataset on the hub. I’ve seen several datasets that are actually storing data in different places (e.g., Zenodo or external websites): librispeech_asr or speech_commands just to name a few. I’ve seen that there usually is a “dataset_name.py” file that is in charge of downloading and arranging data. Is there any guideline to follow or instructions on how to do it?

What is the best way to write such a script? My dataset contains audio files and JSON files for train, validation, and test splits. How can I ensure that the dataset can be loaded using the load_dataset function from the datasets library?

Hi! This doc explains how to create a dataset loading script, and for audio loading script examples, you can filter the existing audio dataset repos on the Hub and inspect their contents.

PS: Zenodo is not a great host (downloading can be slow in some scenarios), so consider uploading the data to the Hub

1 Like

Thank you for the quick reply. I will follow the suggested steps.

Unfortunately, we cannot store data outside Europe (Zenodo is accepted).