Misunderstanding around creating audio datasets from Local files

Hi, in the documentation, it only states how to add audio files, but I want to add audio files and their transcriptions.

How can I do that so I can build a dataset of snippets / transcription that I can train on?

Also, if I want to have 2 separate datasets, one for test and one for training, what’s the approach to follow? Send everything and tag in the metadata.csv or create 2 folders and upload the snippets/transcription with?

Hi ! Here is an example in python:

ds = Dataset.from_dict({
    "audio": ["path/to/audio_1", "path/to/audio_2", ..., "path/to/audio_n"],
    "transcription": ["First transcript", "Second transcript", ..., "Last transcript"],
}).cast_column("audio", Audio())

Alternatively you can also define an AudioFolder (see docs):

my_dataset/
├── README.md
├── metadata.csv
└── data/
    ├── audio_0.wav
    ...
    └── audio_n.wav

Also, if I want to have 2 separate datasets, one for test and one for training, what’s the approach to follow? Send everything and tag in the metadata.csv or create 2 folders and upload the snippets/transcription with?

You can structure your AudioFolder like this:

my_dataset/
├── README.md
├── metadata.csv
├── test/
|   ├── audio_0.wav
|   ...
|   └── audio_n.wav
└── train/
    ├── audio_0.wav
    ...
    └── audio_n.wav

It’s also possible to have one metadata.csv in train/ and one in test/ if you want

1 Like