Understanding the `Datasets` cache system

tonywu71 · May 19, 2023, 9:42am

To whom it may concern,

I am currently working on a ASR project with many different datasets including the ESB dataset and the MLS dataset. However, I reached the maximum quota on the server I’m working with. My final goal is to evaluate my ASR model on these dataset groups so I should be able to bypass the problem using streaming. Yet I am quite interested in how the cache is managed in the Datasets module. If I’m correct, the official Datasets documentation never mentions anything about how to manage/delete an existing cached dataset while the Huggingface Hub documentation is way more detailed.

Example 1 with librispeech_asr:

Using the du command, I know that the huggingface/datasets/librispeech_asr directory is about 300GB. Therefore my guess is that the WAV files are stored inside.

Example 2 with facebook___multilingual_librispeech:

This time, the du command shows that the huggingface/datasets/facebook___multilingual_librispeech directory is only about a few MB. After some quick investigation, I found that the audio files are stored in huggingface/datasets/downloads and possible in huggingface/datasets/downloads/extracted.

My questions:

How can I know the file paths for the audio of a given ASR dataset?
How can I delete one dataset in particular (that is without having to delete the whole cache)?
What’s the difference between huggingface/datasets/downloads and huggingface/datasets/downloads/extracted?
On a slightly different topic, but is there a way to cache only a given split of a dataset? If I’m correct, using the snippet in Code 1 will download the full dataset. Although I could use streaming, caching only part of the dataset would greatly speed up my iterations.

Code 1:

librispeech_en_clean = load_dataset(path="librispeech_asr",
                                    name="clean",
                                    split="test")

Thank you very much in advance for your time.

Yours sincerely,
Tony

mariosasko · May 19, 2023, 4:41pm

Hi!

Some dataset scripts generate arrow files with embedded external files (image/audio files), and some don’t, hence the discrepancy. In Datasets 3.0, we will always embed these bytes.

By disabling decoding of the audio column and fetching the file paths as follows:

ds_raw_audio = ds.cast_column("audio", datasets.Audio(decode=False))
ds_with_audio_paths = ds_raw_audio.map(lambda batched: [audio_dict["path"] for audio_dict in 
batch["audio"]], batched=True, remove_columns=ds.column_names)
audio_paths = ds_with_audio_paths.unique()

It’s tricky to do this with the current cache structure. It’s best to use cache_dir in load_dataset (one cache directory per dataset) and delete that directory later. We aim to align the cache structure with huggingface_hub to simplify deleting/inspecting it.
archives from huggingface/datasets/downloads are extracted to huggingface/datasets/downloads/extracted.
No, due to the script structure we inherited from Tensorflow Datasets. We need to introduce a special script structure to allow this (while preserving backward compatibility).

tonywu71 · May 19, 2023, 8:50pm

Thank you so much, everything’s perfectly clear!

Topic		Replies	Views
Deleting Duplicate Saved Datasets 🤗Datasets	3	4620	September 7, 2022
Load_dataset without saving cache files 🤗Datasets	1	1881	April 19, 2023
How to load cached dataset offline? Beginners	2	4658	May 29, 2022
Where are the actual files to download? Beginners	7	2108	January 8, 2024
Map result saved to a different folder than custom HF_DATASETS_CACHE 🤗Datasets	1	687	June 14, 2022

Understanding the `Datasets` cache system

Related topics