Optimizing Disk Usage for Large (Audio) Datasets

mwirth7 · December 2, 2024, 12:09pm

I believe this is what i am currently doing in the dataset builder here.
Instead of loading the audio using audio.read() which loads the bytes, i’m just passing the filepath string to the datasets.Audio() column.
Here my current problem is that we have uploaded the audios in a tar.gz format which needs to be extracted to be able to pass the filepath string. Hence we extract the archive file. This way we both save the downloaded tar.gz and extracted audios, thus needing double the amount of storage.

Currently i am looking how minds14 handels their extraction and _generate_examples function. Here it should be possible to sequentially extract all tar.gz files and delete them right after extraction.
Would this be a good alternative? Or is there an easier way using AudioFolders?
(Note: We have large amount of individual audio files)

Topic		Replies	Views
Understanding the `Datasets` cache system 🤗Datasets	2	3304	May 19, 2023
BuilderScript cleanup during extract of archives 🤗Datasets	0	66	November 14, 2024
Not able to use Custom Speech Data for training ASR 🤗Datasets	2	320	September 20, 2023
Can Data Files be generated upon dataset load? Beginners	3	454	March 4, 2022
Dataset loading script for an audio dataset 🤗Datasets	5	673	September 2, 2022

Optimizing Disk Usage for Large (Audio) Datasets

Related topics