Optimizing Disk Usage for Large (Audio) Datasets

I’m part of the BirdSet team, and we’ve identified an issue with our current Builder script.

Some of the audio datasets we work with are quite large, and we aim to provide access to individual audio files. To achieve this, we first download the archive file, extract its contents, and then generate the dataset. The reason for accessing the audio files directly is that we don’t need to load the entire audio file but only specific parts, which is possible using the soundfile library. This approach improves audio decoding efficiency and, consequently, reduces training time. However, we still want to provide access to the full audio files for other use cases.

Problem: We noticed that this approach requires more than double the amount of disk space compared to the actual size of the files, as both the archives and the extracted audio files need to be stored simultaneously.

We’ve implemented a workaround that achieves the same functionality, but it feels unintuitive and somewhat hacky. You can view this workaround here.
Essencially we don’t extract the archives anymore and during the generation of the Arrow files we load the audios bytes into the arrow files and delete archives as soon as they are not needed anymore. This way at every moment we only need half of the disk space we need at the moment.

Is this a reasonable way to handle the problem, or are there alternative approaches we might not be aware of?

1 Like

I’m not familiar with the datasets library, but I wonder if iter_archive could be used?

Any suggestion is welcome.

Good catch, but we are already using this, for example.

I was testing earlier if i could extract and immediatly delete the archive but the _generate_examples function iterates through the archives content, so this doesn’t work with our current implementation.
iter_archive seems to be the why that actually doesn’t work. With a clever rewrite this approach could possibly also work. But seems fairly far fetched and not aligned with other builder scripts in the audio domain.

1 Like

I see. Then in this case you would be hard pressed to do without another library to manipulate tar.gz, but maybe you don’t want to add more dependencies…
If you’re going to do it with just the standard Python library, datasets, and soundfile, I do think it’s going to be hacky…
And you probably don’t want to have the means to get the dataset itself beforehand and put it in HF.

Hi ! if you load_dataset() a AudioFolder formatted dataset it won’t double the storage (the arrow table of the dataset will just contain links to the audio files on your disk)

1 Like

Hi! @lhoestq

I believe this is what i am currently doing in the dataset builder here.
Instead of loading the audio using audio.read() which loads the bytes, i’m just passing the filepath string to the datasets.Audio() column.
Here my current problem is that we have uploaded the audios in a tar.gz format which needs to be extracted to be able to pass the filepath string. Hence we extract the archive file. This way we both save the downloaded tar.gz and extracted audios, thus needing double the amount of storage.

Currently i am looking how minds14 handels their extraction and _generate_examples function. Here it should be possible to sequentially extract all tar.gz files and delete them right after extraction.
Would this be a good alternative? Or is there an easier way using AudioFolders?
(Note: We have large amount of individual audio files)

1 Like

I’d recommend you to try AudioFolder or streaming WebDataset which are well optimized already and don’t duplicate the data locally

1 Like