Misunderstanding around creating audio datasets from Local files

lhoestq · January 26, 2023, 9:54am

Hi ! Here is an example in python:

ds = Dataset.from_dict({
    "audio": ["path/to/audio_1", "path/to/audio_2", ..., "path/to/audio_n"],
    "transcription": ["First transcript", "Second transcript", ..., "Last transcript"],
}).cast_column("audio", Audio())

Alternatively you can also define an AudioFolder (see docs):

my_dataset/
├── README.md
├── metadata.csv
└── data/
    ├── audio_0.wav
    ...
    └── audio_n.wav

Also, if I want to have 2 separate datasets, one for test and one for training, what’s the approach to follow? Send everything and tag in the metadata.csv or create 2 folders and upload the snippets/transcription with?

You can structure your AudioFolder like this:

my_dataset/
├── README.md
├── metadata.csv
├── test/
|   ├── audio_0.wav
|   ...
|   └── audio_n.wav
└── train/
    ├── audio_0.wav
    ...
    └── audio_n.wav

It’s also possible to have one metadata.csv in train/ and one in test/ if you want

Topic		Replies	Views
Create own dataset of train and test in separate folders 🤗Datasets	1	773	January 26, 2023
How to create an audio dataset from local files already split into train and test without losing labels Beginners	2	401	March 17, 2024
Audio dataset without uploading the data to the hub 🤗Datasets	6	1957	March 20, 2023
How does one actually create a new dataset? Beginners	2	3269	October 18, 2024
Help with speech dataset loading script 🤗Datasets	2	269	November 28, 2023

Misunderstanding around creating audio datasets from Local files

Related topics