How to load only test dataset from `librispeech_asr`?

IlllIIII · December 7, 2021, 4:11am

Hi, I want to evaluate my model with Librespeech dataset

	Train.500	Train.360	Train.100	Valid	Test
clean	-	104014	28539	2703	2620
other	148688	-	-	2864	2939

I don’t need train.500, train.360, train.100 and valid set. Is there any way to load only test set from librispeech_asr?

I can load all data and only keep test set later but it takes too long to load all data :(( So I want to load only test set.

from datasets import load_dataset, DatasetDict

# It takes forever to load everything here... 
libre_dataset = load_dataset("librispeech_asr", 'clean')

keep = ["test"]
libre_dataset = DatasetDict({k: dataset for k, dataset in libre_dataset .items() if k in keep})

Thanks for reading!

lewtun · December 7, 2021, 8:36am

Hey @IlllIIII you can specify the split you want to load by passing split="test" to the load_dataset() function (docs). This will still download all the splits in the dataset, so if space is an issue you can stream the elements of the test set one by one:

from datasets import load_dataset

dset = load_dataset("librispeech_asr", 'clean', split="test", streaming=True)
next(iter(dset))

IlllIIII · December 7, 2021, 4:11pm

Hey, thanks for the tips! It was super useful

Topic		Replies	Views
Could I download the dataset manually? 🤗Datasets	1	1452	January 24, 2022
Creating a dataset with Librispeech Train_clean_100, Test_clean, and Dev_clean 🤗Datasets	0	255	January 27, 2024
How to load local librispeech? Beginners	1	680	May 15, 2023
Dataset loading script for an audio dataset 🤗Datasets	5	670	September 2, 2022
Load_dataset split=‘test’ not working again Beginners	3	26	April 19, 2025

How to load only test dataset from `librispeech_asr`?

Related topics