How does one actually create a new dataset?

Hello! :slight_smile:

Great question! It set me on an exploration of finding out how to do this, so this advice is not official, but I hope it’s helpful:

  1. Go through Chapter 5 of the HuggingFace course for a high-level view of how to create a dataset: The :hugs: Datasets library - Hugging Face Course.

  2. Read Sharing your dataset.

  3. Read Writing a dataset loading script and see the linked template. If you’ve seen the librispeech_asr.py file in the librispeech dataset repository, this template will look familiar. The nice database-like interface you saw is based on this file.

Regarding your files, they can be uploaded directly to HF Hub using git-lfs, or using some 3rd-party storage, like S3. Like librispeech, you can use the dl_manager parameter to _split_generators method to download the content. dl_manager also takes care of the unzipping.

Good luck! :muscle:

1 Like