Loading and saving the German Train.1h split of the multilingual_librispeech (MLS) dataset

I want to download just the German Train.1h split of the multilingual_librispeech (MLS) dataset and save it to my local machine. After some trial and error, I ended up using the following command:

mls_german_train_1h_dataset = load_dataset('multilingual_librispeech', 'german', split='Train.1h', cache_dir='/path/to/mls-german-train-1h')

However, as I started running it, my terminal told me I was downloading 123 GB of data. I thought, “Can that be right? Is the smallest training split of the MLS German dataset really so big?” Not wanting to download so much data, I interrupted the command.

For comparison, I then started downloading the German Train.9h split instead:

mls_german_train_9h_dataset = load_dataset('multilingual_librispeech', 'german', split='Train.9h', cache_dir='/mnt/nvdl/usr/schilton/mls-german-train-9h')

Once again, the terminal told me that I was downloading 123 GB.

Have I done something wrong in my calls to load_dataset(), or will Python sift through all 123 GB of data and only save the specified split?

Hi! The data for both train.1h and train.9h comes from the same folder hence the equal download size, and this dataset doesn’t support streaming (currently), so downloading the entire data is the only option.