Column type issue pushing ASR dataset using Audiofolders

Hi, I am trying to push an ASR dataset (file_name, transcription) to HF using the AudioFolder functionality. When the dataset is pushed, I expect the 2 column types to be audio and string. For this particular case when the dataset is uploaded, the audio column is of type dict. This causes errors processing the audio (resampling) as it requires an audio type not a dict. Any ideas why audio column is of type dict?

Hi ! Which dataset is it ?

@cohogain hey, can you provide sample code to reproduce the bug and your data structure format and maybe some audio samples?

Hello, please see this example where I tried to upload Audio dataset using audiofolder method and it only uploaded Audio column, not transcription.

I had a dataset before which had audio column as (dict) type and transcription column but I removed it. I will try reproduce now.

I am using the following code snippet to upload data:

from datasets import load_dataset

dataset = load_dataset(“audiofolder”, data_dir=“multiple-ga-IE-v1/data/Fuaimeanna2/data/”)

dataset.push_to_hub(“cohogain/Fuaimeanna3”)


folder structure is as follows:
Fuaimeanna2/
Fuaimeanna2/data/*.wav
Fuaimeanna2/metadata.csv

metadata.csv is as follows:
image

I seem to be unable to reproduce result where audio column was type dict. It now only uploads Audio column as type audio but no transcription text column.

@cohogain thank you for the examples :slight_smile: please try to load your dataset with data_dir pointing to the directory containing all the data including metadata file, that is, multiple-ga-IE-v1/data/Fuaimeanna2:

dataset = load_dataset(“audiofolder”, data_dir=“multiple-ga-IE-v1/data/Fuaimeanna2”)

As your metadata file is located outside of the data directory provided to the loader in your example, it just doesn’t see it.

This worked thank you :slight_smile:

1 Like