How does one actually create a new dataset?

nightlock · February 21, 2022, 5:14am

I am relatively new to the ML scene, at least as far as creating datasets and training models. I have thousands of hours of audio that is perfectly transcribed, and I wanted to make a hugging face dataset from it. I have followed the official tutorial describing how to create the loading script (dataset script) and how to make a dataset card but they seem to be addressing a question I would have later.

Currently, I have close to 250k audio files stored locally and their respective transcriptions. How do I get from where I am now, to a public dataset that people can use to train their models? Specifically:

Obviously, my audio files and my transcripts have to be available publicly somewhere (an s3 bucket for example). Do I store all the audio files unpacked, unzipped, uncompressed? Or am I supposed to zip them up? If I am, how does the dataset card access files individually? When looking at a popular HF dataset, librispeech, it seems like the dataset card is organized like a database, and it has access to each audio file and its respective transcript. How did they do that, when their DL_URLS seem to be zipped up files?
The librispeech dataset card, as I mentioned before, seems to look a lot like a database to me. How do I create entries into this “database”? How do I actually make an entry for each audio file with its respective transcription and other relevant metadata?
Once I create this “database”, how do I actually deploy/publish it? Looking at librispeech’s git repo, I don’t see anywhere where this is done, and I can’t find any documentation on how to do this. It makes sense it isn’t available in the git repo, but you know, I’m just a bit lost.

I hope my questions reveal the “theme” and spirit of my overall question: how do I actually create and publish a complete (audio) dataset?

Of course, questions are encouraged if you don’t understand what I’m trying to do.

beneyal · February 21, 2022, 5:38pm

Hello!

Great question! It set me on an exploration of finding out how to do this, so this advice is not official, but I hope it’s helpful:

Go through Chapter 5 of the HuggingFace course for a high-level view of how to create a dataset: The Datasets library - Hugging Face Course.
Read Sharing your dataset.
Read Writing a dataset loading script and see the linked template. If you’ve seen the librispeech_asr.py file in the librispeech dataset repository, this template will look familiar. The nice database-like interface you saw is based on this file.

Regarding your files, they can be uploaded directly to HF Hub using git-lfs, or using some 3rd-party storage, like S3. Like librispeech, you can use the dl_manager parameter to _split_generators method to download the content. dl_manager also takes care of the unzipping.

Good luck!

paullintilhac · October 18, 2024, 4:24pm

that walkthrough using github issues is honestly pretty frustrating. it spends 95% of the article talking about the specific use case of github issues, and very little time talking about the main subject, which is supposed to be how to create and save a huggingface dataset. there isn’t even a line of code for saving it – arguably the most important part!

Topic		Replies	Views
Misunderstanding around creating audio datasets from Local files 🤗Datasets	12	1771	July 17, 2023
Audio dataset without uploading the data to the hub 🤗Datasets	6	1971	March 20, 2023
Can Data Files be generated upon dataset load? Beginners	3	455	March 4, 2022
Creating a new dataset Beginners	1	252	February 13, 2024
Run on single local file rather than dataset Beginners	1	321	January 30, 2024

How does one actually create a new dataset?

Related topics