I am relatively new to the ML scene, at least as far as creating datasets and training models. I have thousands of hours of audio that is perfectly transcribed, and I wanted to make a hugging face dataset from it. I have followed the official tutorial describing how to create the loading script (dataset script) and how to make a dataset card but they seem to be addressing a question I would have later.
Currently, I have close to 250k audio files stored locally and their respective transcriptions. How do I get from where I am now, to a public dataset that people can use to train their models? Specifically:
-
Obviously, my audio files and my transcripts have to be available publicly somewhere (an s3 bucket for example). Do I store all the audio files unpacked, unzipped, uncompressed? Or am I supposed to zip them up? If I am, how does the dataset card access files individually? When looking at a popular HF dataset, librispeech, it seems like the dataset card is organized like a database, and it has access to each audio file and its respective transcript. How did they do that, when their DL_URLS seem to be zipped up files?
-
The librispeech dataset card, as I mentioned before, seems to look a lot like a database to me. How do I create entries into this “database”? How do I actually make an entry for each audio file with its respective transcription and other relevant metadata?
-
Once I create this “database”, how do I actually deploy/publish it? Looking at librispeech’s git repo, I don’t see anywhere where this is done, and I can’t find any documentation on how to do this. It makes sense it isn’t available in the git repo, but you know, I’m just a bit lost.
I hope my questions reveal the “theme” and spirit of my overall question: how do I actually create and publish a complete (audio) dataset?
Of course, questions are encouraged if you don’t understand what I’m trying to do.