Audio dataset without uploading the data to the hub

vimey · March 17, 2023, 9:36am

Hey there!
I am trying to create a custom audio dataset with local files: I have the audiofiles (mp3) and the corresponding metadata (json). I can’t upload the data to the huggingface hub, because it’s confidential. Does anyone know if there is a way of creating the audio set without uploading the data to the hub?

Thanks!

stevhliu · March 17, 2023, 5:05pm

Hi! You can use the AudioFolder to quickly create the dataset, and make sure you have the metadata file in the same directory (check out the docs here for more details!)

from datasets import load_dataset
dataset = load_dataset("audiofolder", data_dir="/path/to/data")

sriniu · March 19, 2023, 7:10pm

Hi @vimey ,

were you able to acheive this?
if yes, Could you please guide how you achieved it ?

vimey · March 20, 2023, 7:27am

Hey @stevhliu and @sriniu,

unfortunately this solution didn’t work for me. I’ve tried this before, but I only got this as a result:

DatasetDict({
    test: Dataset({
         features: ['audio'],
         num_rows: 1
     })
 })

But it should be two columns (“file_name”, “transcription”) instead of “audio”.

sriniu · March 20, 2023, 8:28am

I did some checks yesterday and sort of worked. I am currently at training step and the training process fails due to PYTORCH_CUDA_ALLOC_CONF error (which I am trying to resolve).

What did I do:

Arrange my data in a local folder with the names audio_0.wav, audio_1.wav… , audio_N.wav.

df train was a pandas dataframe in my case with information like label for each audio file

audio_dict = {}
audio_dict[‘audio’] = [“path/to/audio_1”, “path/to/audio_2”, …, “path/to/audio_n”]
audio_dict[‘label’] = [x for x in df_train.label]
audio_dict[‘split’] = len(df_train)*[‘train’]
audio_dataset = Dataset.from_dict(audio_dict).cast_column(“audio”, Audio())
#note: for now i am not using the split column anywhere in my code, it can be omitted i think

once i got my audio_dataset object, i replaced the “dataset” object with “audio_Dataset” in this notebook, Google Colab

vimey · March 20, 2023, 10:34am

Thanks for your approach! It looks really good

However, I solved my initial issue: I saved the metadata in a json file, which is wrong. You have to save it in a jsonl file. After I changed the file type, audiofolder worked for me.

sriniu · March 20, 2023, 12:55pm

Thats great to know!

Topic		Replies	Views
Misunderstanding around creating audio datasets from Local files 🤗Datasets	12	1786	July 17, 2023
Create own dataset of train and test in separate folders 🤗Datasets	1	780	January 26, 2023
Column type issue pushing ASR dataset using Audiofolders 🤗Datasets	7	503	March 30, 2023
Dataset loading script for an audio dataset 🤗Datasets	5	693	September 2, 2022
Create datasets object from multiple remote audio paths residing in Google Cloud Storage 🤗Datasets	2	376	June 28, 2022

Audio dataset without uploading the data to the hub

df train was a pandas dataframe in my case with information like label for each audio file

Related topics