I want to download just the German Train.1h
split of the multilingual_librispeech
(MLS) dataset and save it to my local machine. After some trial and error, I ended up using the following command:
mls_german_train_1h_dataset = load_dataset('multilingual_librispeech', 'german', split='Train.1h', cache_dir='/path/to/mls-german-train-1h')
However, as I started running it, my terminal told me I was downloading 123 GB of data. I thought, “Can that be right? Is the smallest training split of the MLS German dataset really so big?” Not wanting to download so much data, I interrupted the command.
For comparison, I then started downloading the German Train.9h
split instead:
mls_german_train_9h_dataset = load_dataset('multilingual_librispeech', 'german', split='Train.9h', cache_dir='/mnt/nvdl/usr/schilton/mls-german-train-9h')
Once again, the terminal told me that I was downloading 123 GB.
Have I done something wrong in my calls to load_dataset()
, or will Python sift through all 123 GB of data and only save the specified split?