How does one actually create a new dataset?

beneyal · February 21, 2022, 5:38pm

Hello!

Great question! It set me on an exploration of finding out how to do this, so this advice is not official, but I hope it’s helpful:

Go through Chapter 5 of the HuggingFace course for a high-level view of how to create a dataset: The Datasets library - Hugging Face Course.
Read Sharing your dataset.
Read Writing a dataset loading script and see the linked template. If you’ve seen the librispeech_asr.py file in the librispeech dataset repository, this template will look familiar. The nice database-like interface you saw is based on this file.

Regarding your files, they can be uploaded directly to HF Hub using git-lfs, or using some 3rd-party storage, like S3. Like librispeech, you can use the dl_manager parameter to _split_generators method to download the content. dl_manager also takes care of the unzipping.

Good luck!

Topic		Replies	Views
Can Data Files be generated upon dataset load? Beginners	3	457	March 4, 2022
Audio dataset without uploading the data to the hub 🤗Datasets	6	1976	March 20, 2023
Misunderstanding around creating audio datasets from Local files 🤗Datasets	12	1775	July 17, 2023
How to create a dataset like common voice? 🤗Datasets	2	551	January 31, 2022
How to do that trained huggingface model speech recognation? DeepSpeed	0	403	December 10, 2021

How does one actually create a new dataset?

Related topics