To whom it may concern,
I am currently working on a ASR project with many different datasets including the ESB dataset and the MLS dataset. However, I reached the maximum quota on the server I’m working with. My final goal is to evaluate my ASR model on these dataset groups so I should be able to bypass the problem using streaming
. Yet I am quite interested in how the cache is managed in the Datasets
module. If I’m correct, the official Datasets documentation never mentions anything about how to manage/delete an existing cached dataset while the Huggingface Hub documentation is way more detailed.
Example 1 with librispeech_asr
:
Using the du
command, I know that the huggingface/datasets/librispeech_asr
directory is about 300GB. Therefore my guess is that the WAV files are stored inside.
Example 2 with facebook___multilingual_librispeech
:
This time, the du
command shows that the huggingface/datasets/facebook___multilingual_librispeech
directory is only about a few MB. After some quick investigation, I found that the audio files are stored in huggingface/datasets/downloads
and possible in huggingface/datasets/downloads/extracted
.
My questions:
- How can I know the file paths for the audio of a given ASR dataset?
- How can I delete one dataset in particular (that is without having to delete the whole cache)?
- What’s the difference between
huggingface/datasets/downloads
andhuggingface/datasets/downloads/extracted
? - On a slightly different topic, but is there a way to cache only a given split of a dataset? If I’m correct, using the snippet in Code 1 will download the full dataset. Although I could use
streaming
, caching only part of the dataset would greatly speed up my iterations.
Code 1:
librispeech_en_clean = load_dataset(path="librispeech_asr",
name="clean",
split="test")
Thank you very much in advance for your time.
Yours sincerely,
Tony