Hey,
I want to add a feature to a large audio dataset before my training starts. In particular, it’s the length in seconds such that my HF trainer can “group_by_length” my inputs.
My datasets are all saved locally in a read-only folder (they were saved through save_to_disk()
).
What’s happening now is that:
- when I load the dataset with
load_from_disk()
that folder is by default used as cache, so any map/filter function fails since I don’t have write access to it (e.g., this issue) - If I pass a
cache_filename
with a path where I have write access, the cache files I’m creating are too big, since the whole dataset is cached there (I don’t have enough disk space for that) - If I remove all the original columns through
remove_columns=
and specify a write-access path, the cache file contains correctly only the feature I’m generating (length
in this case). However, when I add it back to the dataset throughadd_column
, the method internally callsflatten_indices()
, which again requires writing access to the dataset dir and crashes my script.
Any ideas?
Other constraints that I have are:
- I cannot keep the dataset in memory
- I cannot compute the lengths on the go since I need them for the length grouping sampler
- I cannot afford to compute each sample length every time I run the script since its it takes too long
- I would like to stay within the
datasets
framework since my codebase uses it in several places