I want to download just the German
Train.1h split of the
multilingual_librispeech (MLS) dataset and save it to my local machine. After some trial and error, I ended up using the following command:
mls_german_train_1h_dataset = load_dataset('multilingual_librispeech', 'german', split='Train.1h', cache_dir='/path/to/mls-german-train-1h')
However, as I started running it, my terminal told me I was downloading 123 GB of data. I thought, “Can that be right? Is the smallest training split of the MLS German dataset really so big?” Not wanting to download so much data, I interrupted the command.
For comparison, I then started downloading the German
Train.9h split instead:
mls_german_train_9h_dataset = load_dataset('multilingual_librispeech', 'german', split='Train.9h', cache_dir='/mnt/nvdl/usr/schilton/mls-german-train-9h')
Once again, the terminal told me that I was downloading 123 GB.
Have I done something wrong in my calls to
load_dataset(), or will Python sift through all 123 GB of data and only save the specified split?