How does one actually create a new dataset?

I am relatively new to the ML scene, at least as far as creating datasets and training models. I have thousands of hours of audio that is perfectly transcribed, and I wanted to make a hugging face dataset from it. I have followed the official tutorial describing how to create the loading script (dataset script) and how to make a dataset card but they seem to be addressing a question I would have later.

Currently, I have close to 250k audio files stored locally and their respective transcriptions. How do I get from where I am now, to a public dataset that people can use to train their models? Specifically:

  1. Obviously, my audio files and my transcripts have to be available publicly somewhere (an s3 bucket for example). Do I store all the audio files unpacked, unzipped, uncompressed? Or am I supposed to zip them up? If I am, how does the dataset card access files individually? When looking at a popular HF dataset, librispeech, it seems like the dataset card is organized like a database, and it has access to each audio file and its respective transcript. How did they do that, when their DL_URLS seem to be zipped up files?

  2. The librispeech dataset card, as I mentioned before, seems to look a lot like a database to me. How do I create entries into this “database”? How do I actually make an entry for each audio file with its respective transcription and other relevant metadata?

  3. Once I create this “database”, how do I actually deploy/publish it? Looking at librispeech’s git repo, I don’t see anywhere where this is done, and I can’t find any documentation on how to do this. It makes sense it isn’t available in the git repo, but you know, I’m just a bit lost.

I hope my questions reveal the “theme” and spirit of my overall question: how do I actually create and publish a complete (audio) dataset?

Of course, questions are encouraged if you don’t understand what I’m trying to do.

1 Like

Hello! :slight_smile:

Great question! It set me on an exploration of finding out how to do this, so this advice is not official, but I hope it’s helpful:

  1. Go through Chapter 5 of the HuggingFace course for a high-level view of how to create a dataset: The :hugs: Datasets library - Hugging Face Course.

  2. Read Sharing your dataset.

  3. Read Writing a dataset loading script and see the linked template. If you’ve seen the librispeech_asr.py file in the librispeech dataset repository, this template will look familiar. The nice database-like interface you saw is based on this file.

Regarding your files, they can be uploaded directly to HF Hub using git-lfs, or using some 3rd-party storage, like S3. Like librispeech, you can use the dl_manager parameter to _split_generators method to download the content. dl_manager also takes care of the unzipping.

Good luck! :muscle:

1 Like

that walkthrough using github issues is honestly pretty frustrating. it spends 95% of the article talking about the specific use case of github issues, and very little time talking about the main subject, which is supposed to be how to create and save a huggingface dataset. there isn’t even a line of code for saving it – arguably the most important part!