Caching only one feature, from a read-only dataset

Hey,

I want to add a feature to a large audio dataset before my training starts. In particular, it’s the length in seconds such that my HF trainer can “group_by_length” my inputs.
My datasets are all saved locally in a read-only folder (they were saved through save_to_disk()).

What’s happening now is that:

  • when I load the dataset with load_from_disk() that folder is by default used as cache, so any map/filter function fails since I don’t have write access to it (e.g., this issue)
  • If I pass a cache_filename with a path where I have write access, the cache files I’m creating are too big, since the whole dataset is cached there (I don’t have enough disk space for that)
  • If I remove all the original columns through remove_columns= and specify a write-access path, the cache file contains correctly only the feature I’m generating (length in this case). However, when I add it back to the dataset through add_column, the method internally calls flatten_indices(), which again requires writing access to the dataset dir and crashes my script.

Any ideas?

Other constraints that I have are:

  • I cannot keep the dataset in memory
  • I cannot compute the lengths on the go since I need them for the length grouping sampler
  • I cannot afford to compute each sample length every time I run the script since its it takes too long
  • I would like to stay within the datasets framework since my codebase uses it in several places
1 Like