Understanding the `Datasets` cache system

To whom it may concern,

I am currently working on a ASR project with many different datasets including the ESB dataset and the MLS dataset. However, I reached the maximum quota on the server I’m working with. My final goal is to evaluate my ASR model on these dataset groups so I should be able to bypass the problem using streaming. Yet I am quite interested in how the cache is managed in the Datasets module. If I’m correct, the official Datasets documentation never mentions anything about how to manage/delete an existing cached dataset while the Huggingface Hub documentation is way more detailed.

Example 1 with librispeech_asr:

Using the du command, I know that the huggingface/datasets/librispeech_asr directory is about 300GB. Therefore my guess is that the WAV files are stored inside.

Example 2 with facebook___multilingual_librispeech:

This time, the du command shows that the huggingface/datasets/facebook___multilingual_librispeech directory is only about a few MB. After some quick investigation, I found that the audio files are stored in huggingface/datasets/downloads and possible in huggingface/datasets/downloads/extracted.

My questions:

  1. How can I know the file paths for the audio of a given ASR dataset?
  2. How can I delete one dataset in particular (that is without having to delete the whole cache)?
  3. What’s the difference between huggingface/datasets/downloads and huggingface/datasets/downloads/extracted?
  4. On a slightly different topic, but is there a way to cache only a given split of a dataset? If I’m correct, using the snippet in Code 1 will download the full dataset. Although I could use streaming, caching only part of the dataset would greatly speed up my iterations.

Code 1:

librispeech_en_clean = load_dataset(path="librispeech_asr",
                                    name="clean",
                                    split="test")

Thank you very much in advance for your time.

Yours sincerely,
Tony

Hi!

Some dataset scripts generate arrow files with embedded external files (image/audio files), and some don’t, hence the discrepancy. In Datasets 3.0, we will always embed these bytes.

  1. By disabling decoding of the audio column and fetching the file paths as follows:
    ds_raw_audio = ds.cast_column("audio", datasets.Audio(decode=False))
    ds_with_audio_paths = ds_raw_audio.map(lambda batched: [audio_dict["path"] for audio_dict in 
    batch["audio"]], batched=True, remove_columns=ds.column_names)
    audio_paths = ds_with_audio_paths.unique()
    
  2. It’s tricky to do this with the current cache structure. It’s best to use cache_dir in load_dataset (one cache directory per dataset) and delete that directory later. We aim to align the cache structure with huggingface_hub to simplify deleting/inspecting it.
  3. archives from huggingface/datasets/downloads are extracted to huggingface/datasets/downloads/extracted.
  4. No, due to the script structure we inherited from Tensorflow Datasets. We need to introduce a special script structure to allow this (while preserving backward compatibility).
5 Likes

Thank you so much, everything’s perfectly clear!