Load dataset from a specific cache file

Dear community,

Does anyone know if there is a way to load dataset from a specific cached file? I applied very costly transformations using a map function to do some data augmentation and finally, I had my dataset. However, now in my main script, I would like to use the cached file built from another script. Is there any way to call a specific cache file from ~/.cache/huggingface/datasets/ ?

You can reload any arrow file from the cache with

from datasets import Dataset 

ds = Dataset.from_file("path/to/data.arrow")

Does from_file try to read the whole dataset?

I am in a similar place where I spent days adding columns to a dataset using .map() but the execution failed when I tried to save_to_disk. So I am hoping to save it from the cache.
The map operation ended successfully making the dataset huge (1.3TB)

Originally i planned to load the dataset as an iterable to only use enough rows in a batch that will fit in Memory, I am currently trying Datasets.from_file(“the larger arrow file”). But it’s really taking its time so I am wondering if this is still usable. I am also not sure which arrow file I should be trying to read I am hoping that the one with the cache-prefix contains all the data and not just part of it.

The “cache-9aaxxxxx” file should be the one indeed :slight_smile:

Dataset.from_file should work - what takes time is reading the metadata of all the record batches (=chunks of arrow files). It doesn’t load the actual dataset content in memory.

Alternatively you can use IterableDataset.from_file which doesn’t read the metadata, but we haven’t implemented save_to_disk for IterableDataset