Column type issue pushing ASR dataset using Audiofolders

cohogain · February 28, 2023, 7:27pm

Hi, I am trying to push an ASR dataset (file_name, transcription) to HF using the AudioFolder functionality. When the dataset is pushed, I expect the 2 column types to be audio and string. For this particular case when the dataset is uploaded, the audio column is of type dict. This causes errors processing the audio (resampling) as it requires an audio type not a dict. Any ideas why audio column is of type dict?

lhoestq · March 3, 2023, 3:23pm

Hi ! Which dataset is it ?

polinaeterna · March 6, 2023, 6:42pm

@cohogain hey, can you provide sample code to reproduce the bug and your data structure format and maybe some audio samples?

cohogain · March 21, 2023, 6:25pm

Hello, please see this example where I tried to upload Audio dataset using audiofolder method and it only uploaded Audio column, not transcription.

I had a dataset before which had audio column as (dict) type and transcription column but I removed it. I will try reproduce now.

cohogain · March 21, 2023, 6:34pm

I am using the following code snippet to upload data:

from datasets import load_dataset

dataset = load_dataset(“audiofolder”, data_dir=“multiple-ga-IE-v1/data/Fuaimeanna2/data/”)

dataset.push_to_hub(“cohogain/Fuaimeanna3”)

folder structure is as follows:
Fuaimeanna2/
Fuaimeanna2/data/*.wav
Fuaimeanna2/metadata.csv

metadata.csv is as follows:

cohogain · March 21, 2023, 6:36pm

I seem to be unable to reproduce result where audio column was type dict. It now only uploads Audio column as type audio but no transcription text column.

polinaeterna · March 22, 2023, 12:04pm

@cohogain thank you for the examples please try to load your dataset with data_dir pointing to the directory containing all the data including metadata file, that is, multiple-ga-IE-v1/data/Fuaimeanna2:

dataset = load_dataset(“audiofolder”, data_dir=“multiple-ga-IE-v1/data/Fuaimeanna2”)

As your metadata file is located outside of the data directory provided to the loader in your example, it just doesn’t see it.

cohogain · March 30, 2023, 6:53pm

This worked thank you

Topic		Replies	Views
Audio dataset without uploading the data to the hub 🤗Datasets	6	1958	March 20, 2023
Problem with Dataset Preview with audio files 🤗Datasets	7	1233	April 17, 2025
Misunderstanding around creating audio datasets from Local files 🤗Datasets	12	1762	July 17, 2023
Dataset loading script for an audio dataset 🤗Datasets	5	672	September 2, 2022
Uploading an audio dataset keeps failing at "Uploading the dataset shards" Beginners	2	351	March 15, 2024

Column type issue pushing ASR dataset using Audiofolders

folder structure is as follows: Fuaimeanna2/ Fuaimeanna2/data/*.wav Fuaimeanna2/metadata.csv

Related topics

folder structure is as follows:
Fuaimeanna2/
Fuaimeanna2/data/*.wav
Fuaimeanna2/metadata.csv