How to load only test dataset from `librispeech_asr`?

Hi, I want to evaluate my model with Librespeech dataset :slight_smile:

Train.500 Train.360 Train.100 Valid Test
clean - 104014 28539 2703 2620
other 148688 - - 2864 2939

I don’t need train.500, train.360, train.100 and valid set. Is there any way to load only test set from librispeech_asr?

I can load all data and only keep test set later but it takes too long to load all data :(( So I want to load only test set.

from datasets import load_dataset, DatasetDict

# It takes forever to load everything here... 
libre_dataset = load_dataset("librispeech_asr", 'clean')

keep = ["test"]
libre_dataset = DatasetDict({k: dataset for k, dataset in libre_dataset .items() if k in keep})

Thanks for reading!

Hey @IlllIIII you can specify the split you want to load by passing split="test" to the load_dataset() function (docs). This will still download all the splits in the dataset, so if space is an issue you can stream the elements of the test set one by one:

from datasets import load_dataset

dset = load_dataset("librispeech_asr", 'clean', split="test", streaming=True)
next(iter(dset))
1 Like

Hey, thanks for the tips! It was super useful :slight_smile:

1 Like