Dataset loading script for an audio dataset

comodoro · August 31, 2022, 6:14am

I want to fix all metadata and loading for my audio datasets (like comodoro/vystadial2016_asr), but cannot find specific documentation for audio datasets, in particular how to load into an Audio column.

Looked at librispeech_asr/blob/main/librispeech_asr.py, but audio data there is only referenced in feature metadata and _generate_examples. I tried to emulate it at least, but I got stuck at these lines with [Errno 2] No such file or directory: 'data_voip_cs_2016'; I do not know what exactly the download manager returns and how to access it, not even looking at the source.

Several questions:

Is there any more documentation?
Can I debug the script locally instead of adding a commit for every fix attempt?
I would also like to add a loading script for some json (HF hosted) audio datasets like datasets/comodoro/pscr, is that even possible?

comodoro · September 1, 2022, 6:21am

AD 2, datases-cli, I suppose

lhoestq · September 1, 2022, 1:40pm

Maybe @polinaeterna ?

comodoro · September 1, 2022, 2:05pm

I have made some progress (DownloadManager returns different type for multiple and single downloads; not strictly speaking an error, but confusing). Now The split is being processed. Retry later. for two hours. Something is still wrong.

lewtun · September 2, 2022, 9:37am

Hey @comodoro there’s a new audiofolder feature that let’s you load audio datasets with metadata: Load audio data

You’ll need to install datasets from main to make use of it

comodoro · September 2, 2022, 10:23am

Thanks, I will try it with the next ones. In the meantime, succes!. Last few errors were caused by me not following the dataset structure, different from Librispeech. Frankly if not for my strong determination, I would have given up a long time ago. Lessons learned:

ALWAYS ensure datasets-cli runs OK before making a commit
generate_examples indeed generates all samples, it is just badly named.

Topic		Replies	Views
Audio dataset without uploading the data to the hub 🤗Datasets	6	1963	March 20, 2023
Dataset load_datasets from directory when metadata and datafile in different folder 🤗Datasets	1	396	August 16, 2023
Help with speech dataset loading script 🤗Datasets	2	269	November 28, 2023
Why load_dataset on Audiofolder with metadata is returning Filenotfound error 🤗Datasets	6	1218	August 18, 2023
Can Data Files be generated upon dataset load? Beginners	3	454	March 4, 2022

Dataset loading script for an audio dataset

Related topics