Optimizing Disk Usage for Large (Audio) Datasets

mwirth7 · November 29, 2024, 2:17pm

I’m part of the BirdSet team, and we’ve identified an issue with our current Builder script.

Some of the audio datasets we work with are quite large, and we aim to provide access to individual audio files. To achieve this, we first download the archive file, extract its contents, and then generate the dataset. The reason for accessing the audio files directly is that we don’t need to load the entire audio file but only specific parts, which is possible using the soundfile library. This approach improves audio decoding efficiency and, consequently, reduces training time. However, we still want to provide access to the full audio files for other use cases.

Problem: We noticed that this approach requires more than double the amount of disk space compared to the actual size of the files, as both the archives and the extracted audio files need to be stored simultaneously.

We’ve implemented a workaround that achieves the same functionality, but it feels unintuitive and somewhat hacky. You can view this workaround here.
Essencially we don’t extract the archives anymore and during the generation of the Arrow files we load the audios bytes into the arrow files and delete archives as soon as they are not needed anymore. This way at every moment we only need half of the disk space we need at the moment.

Is this a reasonable way to handle the problem, or are there alternative approaches we might not be aware of?

John6666 · November 29, 2024, 2:29pm

I’m not familiar with the datasets library, but I wonder if iter_archive could be used?

github.com

huggingface/datasets/blob/3.1.0/src/datasets/download/download_manager.py#L234


      
          def _download_single(self, url_or_filename: str, download_config: DownloadConfig) -> str:
              url_or_filename = str(url_or_filename)
              if is_relative_path(url_or_filename):
                  # append the relative path to the base_path
                  url_or_filename = url_or_path_join(self._base_path, url_or_filename)
              out = cached_path(url_or_filename, download_config=download_config)
              out = tracked_str(out)
              out.set_origin(url_or_filename)
              return out
          
          def iter_archive(self, path_or_buf: Union[str, io.BufferedReader]):
              """Iterate over files within an archive.
          
              Args:
                  path_or_buf (`str` or `io.BufferedReader`):
                      Archive path or archive binary file object.
          
              Yields:
                  `tuple[str, io.BufferedReader]`:
                      2-tuple (path_within_archive, file_object).
                      File object is opened in binary mode.

mwirth7 · November 29, 2024, 2:43pm

Any suggestion is welcome.

Good catch, but we are already using this, for example.

I was testing earlier if i could extract and immediatly delete the archive but the _generate_examples function iterates through the archives content, so this doesn’t work with our current implementation.
iter_archive seems to be the why that actually doesn’t work. With a clever rewrite this approach could possibly also work. But seems fairly far fetched and not aligned with other builder scripts in the audio domain.

John6666 · November 29, 2024, 2:51pm

I see. Then in this case you would be hard pressed to do without another library to manipulate tar.gz, but maybe you don’t want to add more dependencies…
If you’re going to do it with just the standard Python library, datasets, and soundfile, I do think it’s going to be hacky…
And you probably don’t want to have the means to get the dataset itself beforehand and put it in HF.

lhoestq · November 30, 2024, 3:54pm

Hi ! if you load_dataset() a AudioFolder formatted dataset it won’t double the storage (the arrow table of the dataset will just contain links to the audio files on your disk)

mwirth7 · December 2, 2024, 12:09pm

Hi! @lhoestq

I believe this is what i am currently doing in the dataset builder here.
Instead of loading the audio using audio.read() which loads the bytes, i’m just passing the filepath string to the datasets.Audio() column.
Here my current problem is that we have uploaded the audios in a tar.gz format which needs to be extracted to be able to pass the filepath string. Hence we extract the archive file. This way we both save the downloaded tar.gz and extracted audios, thus needing double the amount of storage.

Currently i am looking how minds14 handels their extraction and _generate_examples function. Here it should be possible to sequentially extract all tar.gz files and delete them right after extraction.
Would this be a good alternative? Or is there an easier way using AudioFolders?
(Note: We have large amount of individual audio files)

lhoestq · December 2, 2024, 6:49pm

I’d recommend you to try AudioFolder or streaming WebDataset which are well optimized already and don’t duplicate the data locally

Topic		Replies	Views
Understanding the `Datasets` cache system 🤗Datasets	2	3282	May 19, 2023
BuilderScript cleanup during extract of archives 🤗Datasets	0	65	November 14, 2024
Not able to use Custom Speech Data for training ASR 🤗Datasets	2	320	September 20, 2023
Can Data Files be generated upon dataset load? Beginners	3	454	March 4, 2022
Dataset loading script for an audio dataset 🤗Datasets	5	673	September 2, 2022

Optimizing Disk Usage for Large (Audio) Datasets

Related topics