Audio dataset without uploading the data to the hub

Hey there!
I am trying to create a custom audio dataset with local files: I have the audiofiles (mp3) and the corresponding metadata (json). I can’t upload the data to the huggingface hub, because it’s confidential. Does anyone know if there is a way of creating the audio set without uploading the data to the hub?


Hi! You can use the AudioFolder to quickly create the dataset, and make sure you have the metadata file in the same directory (check out the docs here for more details!) :grinning:

from datasets import load_dataset
dataset = load_dataset("audiofolder", data_dir="/path/to/data")

Hi @vimey ,

were you able to acheive this?
if yes, Could you please guide how you achieved it ?

Hey @stevhliu and @sriniu,

unfortunately this solution didn’t work for me. I’ve tried this before, but I only got this as a result:

    test: Dataset({
         features: ['audio'],
         num_rows: 1

But it should be two columns (“file_name”, “transcription”) instead of “audio”.

I did some checks yesterday and sort of worked. I am currently at training step and the training process fails due to PYTORCH_CUDA_ALLOC_CONF error (which I am trying to resolve).

What did I do:

  1. Arrange my data in a local folder with the names audio_0.wav, audio_1.wav… , audio_N.wav.

df train was a pandas dataframe in my case with information like label for each audio file

audio_dict = {}
audio_dict[‘audio’] = [“path/to/audio_1”, “path/to/audio_2”, …, “path/to/audio_n”]
audio_dict[‘label’] = [x for x in df_train.label]
audio_dict[‘split’] = len(df_train)*[‘train’]
audio_dataset = Dataset.from_dict(audio_dict).cast_column(“audio”, Audio())
#note: for now i am not using the split column anywhere in my code, it can be omitted i think

once i got my audio_dataset object, i replaced the “dataset” object with “audio_Dataset” in this notebook, Google Colab

Thanks for your approach! It looks really good :slight_smile:

However, I solved my initial issue: I saved the metadata in a json file, which is wrong. You have to save it in a jsonl file. After I changed the file type, audiofolder worked for me.

Thats great to know!