Hi! @lhoestq
I believe this is what i am currently doing in the dataset builder here.
Instead of loading the audio using audio.read()
which loads the bytes, i’m just passing the filepath string to the datasets.Audio()
column.
Here my current problem is that we have uploaded the audios in a tar.gz
format which needs to be extracted to be able to pass the filepath string. Hence we extract the archive file. This way we both save the downloaded tar.gz
and extracted audios, thus needing double the amount of storage.
Currently i am looking how minds14 handels their extraction and _generate_examples
function. Here it should be possible to sequentially extract all tar.gz
files and delete them right after extraction.
Would this be a good alternative? Or is there an easier way using AudioFolders?
(Note: We have large amount of individual audio files)