I am working on the BirdSet dataset. We recently noticed that after the tar files are downloaded and extracted the size is double, because archives and audio files both remain in the cache folder.
So far i tried to simply extract the files using the provided download manager one by one and deleting the archive path. Which is the behaviour i am looking for. But when doing that the ‘generate_example’ function fails by reading the just deleted archive.
So far i have not found a solution that follows this behaviour and doesn’t break streaming.
My question: Is there a simple way to delete the archives as soon as they aren’t needed anymore (after the extraction)?
This way users would not need double the storage space of the original files.