Dataset loading script for an audio dataset

I want to fix all metadata and loading for my audio datasets (like comodoro/vystadial2016_asr), but cannot find specific documentation for audio datasets, in particular how to load into an Audio column.

Looked at librispeech_asr/blob/main/librispeech_asr.py, but audio data there is only referenced in feature metadata and _generate_examples. I tried to emulate it at least, but I got stuck at these lines with [Errno 2] No such file or directory: 'data_voip_cs_2016'; I do not know what exactly the download manager returns and how to access it, not even looking at the source.

Several questions:

  • Is there any more documentation?
  • Can I debug the script locally instead of adding a commit for every fix attempt?
  • I would also like to add a loading script for some json (HF hosted) audio datasets like datasets/comodoro/pscr, is that even possible?

AD 2, datases-cli, I suppose

Maybe @polinaeterna ?

I have made some progress (DownloadManager returns different type for multiple and single downloads; not strictly speaking an error, but confusing). Now The split is being processed. Retry later. for two hours. Something is still wrong.

Hey @comodoro there’s a new audiofolder feature that let’s you load audio datasets with metadata: Load audio data

You’ll need to install datasets from main to make use of it :slight_smile:

Thanks, I will try it with the next ones. In the meantime, succes!. Last few errors were caused by me not following the dataset structure, different from Librispeech. Frankly if not for my strong determination, I would have given up a long time ago. Lessons learned:

  • ALWAYS ensure datasets-cli runs OK before making a commit
  • generate_examples indeed generates all samples, it is just badly named.
1 Like