How to clip audio files in an audio dataset?

FeryET · May 31, 2022, 1:27pm

Hi.

I’m trying to use common_voice dataset, but I want to keep the audio files to a maximum of 5 seconds. How can I achieve that?

lhoestq · June 1, 2022, 2:22pm

Hi ! You can use filter to only keep the files that are less than 5 seconds:

from datasets import load_dataset

def is_short(example, max_length_in_secconds=5):
    arr = example["audio"]["array"]
    sampling_rate = example["audio"]["sampling_rate"]
    length_in_seconds = arr.shape[0] / sampling_rate
    return length_in_seconds < max_length_in_secconds

ds = load_dataset("common_voice", "ab", split="train")
ds = ds.filter(is_short)

Topic		Replies	Views
Is it possible to reuse only part of an already loaded audio dataset? Beginners	0	65	June 14, 2024
How to save/use only the first 20k samples of a dataset 🤗Datasets	1	63	December 23, 2024
Saving Datasets vs Dataset Cache 🤗Datasets	1	451	February 10, 2024
Datasets map modifying audio array to list? 🤗Datasets	1	1272	November 29, 2021
How to process the first 20k samples of a dataset without downloading all of it? 🤗Datasets	1	1299	September 1, 2023

How to clip audio files in an audio dataset?

Related topics