Load dataset from a specific cache file

mdelas · October 26, 2022, 8:13am

Dear community,

Does anyone know if there is a way to load dataset from a specific cached file? I applied very costly transformations using a map function to do some data augmentation and finally, I had my dataset. However, now in my main script, I would like to use the cached file built from another script. Is there any way to call a specific cache file from ~/.cache/huggingface/datasets/ ?

lhoestq · October 26, 2022, 11:12am

You can reload any arrow file from the cache with

from datasets import Dataset 

ds = Dataset.from_file("path/to/data.arrow")

jahb57 · February 26, 2024, 3:48pm

Does from_file try to read the whole dataset?

I am in a similar place where I spent days adding columns to a dataset using .map() but the execution failed when I tried to save_to_disk. So I am hoping to save it from the cache.
The map operation ended successfully making the dataset huge (1.3TB)

Originally i planned to load the dataset as an iterable to only use enough rows in a batch that will fit in Memory, I am currently trying Datasets.from_file(“the larger arrow file”). But it’s really taking its time so I am wondering if this is still usable. I am also not sure which arrow file I should be trying to read I am hoping that the one with the cache-prefix contains all the data and not just part of it.

lhoestq · February 26, 2024, 6:15pm

The “cache-9aaxxxxx” file should be the one indeed

Dataset.from_file should work - what takes time is reading the metadata of all the record batches (=chunks of arrow files). It doesn’t load the actual dataset content in memory.

Alternatively you can use IterableDataset.from_file which doesn’t read the metadata, but we haven’t implemented save_to_disk for IterableDataset

Topic		Replies	Views
Loading dataset from cache .arrow file 🤗Datasets	1	749	March 28, 2023
Best way to access the cached transformation arrow file 🤗Datasets	9	3122	January 19, 2024
[urgent]Can you reconstruct datasets using the cache file (.arrow file)? 🤗Datasets	5	1074	August 27, 2021
Caching a dataset with map() when loaded with from_dict() 🤗Datasets	3	2729	March 22, 2023
Loading Dataset from Cache Data Intermediate	1	134	September 30, 2024

Load dataset from a specific cache file

Related topics