Can Data Files be generated upon dataset load?

nightlock · March 2, 2022, 1:48pm

I am trying to create a public dataset of audio records and their respective transcriptions, but I am not interested in hosting the data as it is expensive. The size of the data that I would need to host is really large.(here, data is comprised of two things: the audio files themselves, and the data files, which is essentially one or more storage files (json, csv, etc) that record the metadata/features for each audio clip). Luckily in my case, the audio files are actually all already hosted elsewhere (therefore, I do not need to host them) and generating a record (data files are comprised of records) for an audio clip is a relatively simple enough task. It’s not really important why, but to add some context, the data files will be easy to generate because the audio corpus is essentially a collection of 50+ speakers reading the same book. So… 50 speakers, each with about 50 hours of speech. But the book is just a couple hundred kilobyte file, and each audio clip is named in such a way where it is clear what “portion” of the book the audio clip is reading. So including the transcription is an extremely easy problem. All of this is somewhat irrelevant - just adding context.

Now, all of the tutorials and examples of other loading script’s seem to suggest that the compressed file to be downloaded must contain the data files and the audio files. The _split_generators goal seems to be to download this compressed file, and clarify to the dataset class where the audio files and data files live. My question is, is it possible to have the compressed file contain just the audio, and the data files be generated once the audio files finish downloading? Beyond just “possible”, because anything is possible I guess, is it “okay?” Is it not convention-breaking (in a negative way) and debilitating?

lhoestq · March 4, 2022, 12:53pm

Hi ! Yes the audio files and data files can come from different places: you can download the audio files and then the data files, and in the order you want. And no, it doesn’t break any convention The dataset scripts are designed to be flexible

Here are a few examples:

if your audio files and data files are in the same TAR archive: common_voice.py · common_voice at main
if your audio files and data files are in the same ZIP archive:
vctk.py · vctk at main
if your audio files are in TAR archives and your data files are elsewhere:
multilingual_librispeech.py · facebook/multilingual_librispeech at main

nightlock · March 4, 2022, 2:31pm

Hello lhoestq,

Thank you so much for you answer, this is awesome. I just have a couple of followups:

Given that audio files and data files can live separately, does this mean that I can generate the data files locally, then include that in the “path?” Looking at the third link you attached, lines 112-115:

train_kwargs = {
            "transcript_path": download_transcript(split="train"),
            "audio_archives": download_audio(split="train")
        }

It looks like HF wishes to download them. Could I just include them as local path after they are generated?

Am I correct in assuming that, for audio files, the audio field (that is a datasets.Audio() type) need not be included in the data file, and it is auto generated by HF? All the datasets I have looked at don’t actually have an audio “field” in their data files, though they do define it explicitly in the features definition.

Thank you so much!

lhoestq · March 4, 2022, 4:30pm

You can download the csv/json files and then generate the dict records in python by reading your csv/json files on-the-fly in _generate_examples, without having to new files on disk.
You have to provide the path to the audio file in the “audio” field when you yield an example in _generate_examples

Topic		Replies	Views
Dataset loading script for an audio dataset 🤗Datasets	5	672	September 2, 2022
Audio dataset without uploading the data to the hub 🤗Datasets	6	1957	March 20, 2023
Dataset load_datasets from directory when metadata and datafile in different folder 🤗Datasets	1	396	August 16, 2023
How does one actually create a new dataset? Beginners	2	3276	October 18, 2024
Misunderstanding around creating audio datasets from Local files 🤗Datasets	12	1760	July 17, 2023

Can Data Files be generated upon dataset load?

Related topics