Can Data Files be generated upon dataset load?

I am trying to create a public dataset of audio records and their respective transcriptions, but I am not interested in hosting the data as it is expensive. The size of the data that I would need to host is really large.(here, data is comprised of two things: the audio files themselves, and the data files, which is essentially one or more storage files (json, csv, etc) that record the metadata/features for each audio clip). Luckily in my case, the audio files are actually all already hosted elsewhere (therefore, I do not need to host them) and generating a record (data files are comprised of records) for an audio clip is a relatively simple enough task. It’s not really important why, but to add some context, the data files will be easy to generate because the audio corpus is essentially a collection of 50+ speakers reading the same book. So… 50 speakers, each with about 50 hours of speech. But the book is just a couple hundred kilobyte file, and each audio clip is named in such a way where it is clear what “portion” of the book the audio clip is reading. So including the transcription is an extremely easy problem. All of this is somewhat irrelevant - just adding context.

Now, all of the tutorials and examples of other loading script’s seem to suggest that the compressed file to be downloaded must contain the data files and the audio files. The _split_generators goal seems to be to download this compressed file, and clarify to the dataset class where the audio files and data files live. My question is, is it possible to have the compressed file contain just the audio, and the data files be generated once the audio files finish downloading? Beyond just “possible”, because anything is possible I guess, is it “okay?” Is it not convention-breaking (in a negative way) and debilitating?

Hi ! Yes the audio files and data files can come from different places: you can download the audio files and then the data files, and in the order you want. And no, it doesn’t break any convention :slight_smile: The dataset scripts are designed to be flexible

Here are a few examples:

Hello lhoestq,

Thank you so much for you answer, this is awesome. I just have a couple of followups:

  1. Given that audio files and data files can live separately, does this mean that I can generate the data files locally, then include that in the “path?” Looking at the third link you attached, lines 112-115:
train_kwargs = {
            "transcript_path": download_transcript(split="train"),
            "audio_archives": download_audio(split="train")
        }

It looks like HF wishes to download them. Could I just include them as local path after they are generated?

  1. Am I correct in assuming that, for audio files, the audio field (that is a datasets.Audio() type) need not be included in the data file, and it is auto generated by HF? All the datasets I have looked at don’t actually have an audio “field” in their data files, though they do define it explicitly in the features definition.

Thank you so much!

  1. You can download the csv/json files and then generate the dict records in python by reading your csv/json files on-the-fly in _generate_examples, without having to new files on disk.
  2. You have to provide the path to the audio file in the “audio” field when you yield an example in _generate_examples