Caching only one feature, from a read-only dataset

g8a9 · March 31, 2025, 7:04pm

Hey,

I want to add a feature to a large audio dataset before my training starts. In particular, it’s the length in seconds such that my HF trainer can “group_by_length” my inputs.
My datasets are all saved locally in a read-only folder (they were saved through save_to_disk()).

What’s happening now is that:

when I load the dataset with load_from_disk() that folder is by default used as cache, so any map/filter function fails since I don’t have write access to it (e.g., this issue)
If I pass a cache_filename with a path where I have write access, the cache files I’m creating are too big, since the whole dataset is cached there (I don’t have enough disk space for that)
If I remove all the original columns through remove_columns= and specify a write-access path, the cache file contains correctly only the feature I’m generating (length in this case). However, when I add it back to the dataset through add_column, the method internally calls flatten_indices(), which again requires writing access to the dataset dir and crashes my script.

Any ideas?

Other constraints that I have are:

I cannot keep the dataset in memory
I cannot compute the lengths on the go since I need them for the length grouping sampler
I cannot afford to compute each sample length every time I run the script since its it takes too long
I would like to stay within the datasets framework since my codebase uses it in several places

Topic		Replies	Views
Caching a dataset with map() when loaded with from_dict() 🤗Datasets	3	2738	March 22, 2023
Dataset map() creates lot of cache files 🤗Datasets	6	6594	March 26, 2024
Keeping only current dataset state in cache 🤗Datasets	3	1308	August 30, 2022
In-memory dataset to disk for caching operations 🤗Datasets	1	935	May 2, 2022
Why is simply accessing dataset features so slow? 🤗Datasets	3	3803	November 22, 2021

Caching only one feature, from a read-only dataset

Related topics