Loading and saving the German Train.1h split of the multilingual_librispeech (MLS) dataset

svenchilton · April 25, 2022, 11:59pm

I want to download just the German Train.1h split of the multilingual_librispeech (MLS) dataset and save it to my local machine. After some trial and error, I ended up using the following command:

mls_german_train_1h_dataset = load_dataset('multilingual_librispeech', 'german', split='Train.1h', cache_dir='/path/to/mls-german-train-1h')

However, as I started running it, my terminal told me I was downloading 123 GB of data. I thought, “Can that be right? Is the smallest training split of the MLS German dataset really so big?” Not wanting to download so much data, I interrupted the command.

For comparison, I then started downloading the German Train.9h split instead:

mls_german_train_9h_dataset = load_dataset('multilingual_librispeech', 'german', split='Train.9h', cache_dir='/mnt/nvdl/usr/schilton/mls-german-train-9h')

Once again, the terminal told me that I was downloading 123 GB.

Have I done something wrong in my calls to load_dataset(), or will Python sift through all 123 GB of data and only save the specified split?

mariosasko · May 2, 2022, 1:39pm

Hi! The data for both train.1h and train.9h comes from the same folder hence the equal download size, and this dataset doesn’t support streaming (currently), so downloading the entire data is the only option.

Topic		Replies	Views
How to load only test dataset from `librispeech_asr`? 🤗Datasets	2	2884	December 7, 2021
German ASR: Fine-Tuning Wav2Vec2 Languages at Hugging Face	17	3681	February 18, 2022
Understanding the `Datasets` cache system 🤗Datasets	2	3218	May 19, 2023
How to save/use only the first 20k samples of a dataset 🤗Datasets	1	63	December 23, 2024
Load_dataset split=‘test’ not working again Beginners	3	26	April 19, 2025

Loading and saving the German Train.1h split of the multilingual_librispeech (MLS) dataset

Related topics