Hi,
I’m planning to publish a dataset on Zenodo and I would really like to have a connector that easily integrates the dataset on the hub. I’ve seen several datasets that are actually storing data in different places (e.g., Zenodo or external websites): librispeech_asr
or speech_commands
just to name a few. I’ve seen that there usually is a “dataset_name.py” file that is in charge of downloading and arranging data. Is there any guideline to follow or instructions on how to do it?
What is the best way to write such a script? My dataset contains audio files and JSON files for train, validation, and test splits. How can I ensure that the dataset can be loaded using the load_dataset
function from the datasets
library?