Create dataset with data stored in Zenodo

morenolq · June 14, 2023, 8:01am

Hi,

I’m planning to publish a dataset on Zenodo and I would really like to have a connector that easily integrates the dataset on the hub. I’ve seen several datasets that are actually storing data in different places (e.g., Zenodo or external websites): librispeech_asr or speech_commands just to name a few. I’ve seen that there usually is a “dataset_name.py” file that is in charge of downloading and arranging data. Is there any guideline to follow or instructions on how to do it?

What is the best way to write such a script? My dataset contains audio files and JSON files for train, validation, and test splits. How can I ensure that the dataset can be loaded using the load_dataset function from the datasets library?

mariosasko · June 14, 2023, 2:38pm

Hi! This doc explains how to create a dataset loading script, and for audio loading script examples, you can filter the existing audio dataset repos on the Hub and inspect their contents.

PS: Zenodo is not a great host (downloading can be slow in some scenarios), so consider uploading the data to the Hub

morenolq · June 14, 2023, 8:11pm

Thank you for the quick reply. I will follow the suggested steps.

Unfortunately, we cannot store data outside Europe (Zenodo is accepted).

Topic		Replies	Views
I had collected data for a language text for translation How can I add it up into datsets 🤗Datasets	7	1574	August 23, 2021
Dataset loading script for an audio dataset 🤗Datasets	5	669	September 2, 2022
How does one actually create a new dataset? Beginners	2	3229	October 18, 2024
Audio dataset without uploading the data to the hub 🤗Datasets	6	1950	March 20, 2023
Can Data Files be generated upon dataset load? Beginners	3	453	March 4, 2022

Create dataset with data stored in Zenodo

Related topics