Hey there!
I am trying to create a custom audio dataset with local files: I have the audiofiles (mp3) and the corresponding metadata (json). I can’t upload the data to the huggingface hub, because it’s confidential. Does anyone know if there is a way of creating the audio set without uploading the data to the hub?
Hi! You can use the AudioFolder to quickly create the dataset, and make sure you have the metadata file in the same directory (check out the docs here for more details!)
from datasets import load_dataset
dataset = load_dataset("audiofolder", data_dir="/path/to/data")
I did some checks yesterday and sort of worked. I am currently at training step and the training process fails due to PYTORCH_CUDA_ALLOC_CONF error (which I am trying to resolve).
What did I do:
Arrange my data in a local folder with the names audio_0.wav, audio_1.wav… , audio_N.wav.
df train was a pandas dataframe in my case with information like label for each audio file
audio_dict = {}
audio_dict[‘audio’] = [“path/to/audio_1”, “path/to/audio_2”, …, “path/to/audio_n”]
audio_dict[‘label’] = [x for x in df_train.label]
audio_dict[‘split’] = len(df_train)*[‘train’]
audio_dataset = Dataset.from_dict(audio_dict).cast_column(“audio”, Audio()) #note: for now i am not using the split column anywhere in my code, it can be omitted i think
once i got my audio_dataset object, i replaced the “dataset” object with “audio_Dataset” in this notebook, Google Colab
However, I solved my initial issue: I saved the metadata in a json file, which is wrong. You have to save it in a jsonl file. After I changed the file type, audiofolder worked for me.