Caching only one feature, from a read-only dataset

Hey,

I want to add a feature to a large audio dataset before my training starts. In particular, it’s the length in seconds such that my HF trainer can “group_by_length” my inputs.
My datasets are all saved locally in a read-only folder (they were saved through save_to_disk()).

What’s happening now is that:

  • when I load the dataset with load_from_disk() that folder is by default used as cache, so any map/filter function fails since I don’t have write access to it (e.g., this issue)
  • If I pass a cache_filename with a path where I have write access, the cache files I’m creating are too big, since the whole dataset is cached there (I don’t have enough disk space for that)
  • If I remove all the original columns through remove_columns= and specify a write-access path, the cache file contains correctly only the feature I’m generating (length in this case). However, when I add it back to the dataset through add_column, the method internally calls flatten_indices(), which again requires writing access to the dataset dir and crashes my script.

Any ideas?

Other constraints that I have are:

  • I cannot keep the dataset in memory
  • I cannot compute the lengths on the go since I need them for the length grouping sampler
  • I cannot afford to compute each sample length every time I run the script since its it takes too long
  • I would like to stay within the datasets framework since my codebase uses it in several places
1 Like

I’m sorry, is this response AI-generated?
If possibile, I would try to keep the conversation between humans (and the proposed approach does not address any of my issues :slight_smile: )

Hi ! maybe you can only keep the lengths in memory, and then concatenate back to the memory mapped (i.e. loaded from disk) dataset containing the audio ?

lengths_ds = ds.map(
    compute_length,
    remove_columns=ds.column_names,
    keep_in_memory=True
)
ds = concatenate_datasets([ds, lengths_ds], axis=1)
1 Like

Thanks! So, I guess the concatenate_datasets does not use any caching, right?

yes correct !

2 Likes