Hello!
Great question! It set me on an exploration of finding out how to do this, so this advice is not official, but I hope it’s helpful:
-
Go through Chapter 5 of the HuggingFace course for a high-level view of how to create a dataset: The
Datasets library - Hugging Face Course.
-
Read Sharing your dataset.
-
Read Writing a dataset loading script and see the linked template. If you’ve seen the
librispeech_asr.py
file in the librispeech dataset repository, this template will look familiar. The nice database-like interface you saw is based on this file.
Regarding your files, they can be uploaded directly to HF Hub using git-lfs, or using some 3rd-party storage, like S3. Like librispeech, you can use the dl_manager
parameter to _split_generators
method to download the content. dl_manager
also takes care of the unzipping.
Good luck!