Caching only one feature, from a read-only dataset

g8a9 · March 31, 2025, 7:04pm

Hey,

I want to add a feature to a large audio dataset before my training starts. In particular, it’s the length in seconds such that my HF trainer can “group_by_length” my inputs.
My datasets are all saved locally in a read-only folder (they were saved through save_to_disk()).

What’s happening now is that:

when I load the dataset with load_from_disk() that folder is by default used as cache, so any map/filter function fails since I don’t have write access to it (e.g., this issue)
If I pass a cache_filename with a path where I have write access, the cache files I’m creating are too big, since the whole dataset is cached there (I don’t have enough disk space for that)
If I remove all the original columns through remove_columns= and specify a write-access path, the cache file contains correctly only the feature I’m generating (length in this case). However, when I add it back to the dataset through add_column, the method internally calls flatten_indices(), which again requires writing access to the dataset dir and crashes my script.

Any ideas?

Other constraints that I have are:

I cannot keep the dataset in memory
I cannot compute the lengths on the go since I need them for the length grouping sampler
I cannot afford to compute each sample length every time I run the script since its it takes too long
I would like to stay within the datasets framework since my codebase uses it in several places

g8a9 · April 1, 2025, 11:29am

I’m sorry, is this response AI-generated?
If possibile, I would try to keep the conversation between humans (and the proposed approach does not address any of my issues )

lhoestq · April 1, 2025, 4:38pm

Hi ! maybe you can only keep the lengths in memory, and then concatenate back to the memory mapped (i.e. loaded from disk) dataset containing the audio ?

lengths_ds = ds.map(
    compute_length,
    remove_columns=ds.column_names,
    keep_in_memory=True
)
ds = concatenate_datasets([ds, lengths_ds], axis=1)

g8a9 · April 1, 2025, 5:04pm

Thanks! So, I guess the concatenate_datasets does not use any caching, right?

lhoestq · April 7, 2025, 10:26am

yes correct !

system · April 7, 2025, 10:27pm

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Caching a dataset with map() when loaded with from_dict() 🤗Datasets	3	2728	March 22, 2023
Dataset map() creates lot of cache files 🤗Datasets	6	6508	March 26, 2024
Keeping only current dataset state in cache 🤗Datasets	3	1302	August 30, 2022
In-memory dataset to disk for caching operations 🤗Datasets	1	929	May 2, 2022
Why is simply accessing dataset features so slow? 🤗Datasets	3	3779	November 22, 2021

Caching only one feature, from a read-only dataset

Related topics