[urgent]Can you reconstruct datasets using the cache file (.arrow file)?

zeyuyun1 · January 2, 2021, 11:01pm

Hi,

Basically I construct a dataset through the mapping function in jupyter notebook. I can find the cache file (.arrow). Is there a way I can reconstruct the dataset similar to “datasets.load_from_disk()”

lhoestq · January 3, 2021, 2:16pm

Sure ! If you only have the .arrow file you can do

from datasets import Dataset 

dataset = Dataset.from_file(path_to_arrow_file)

zeyuyun1 · January 4, 2021, 2:09am

It works! thank you so much!

ezio98 · August 24, 2021, 12:58pm

I think your answer can easily load if the arrow_file is single. But when I process the data using multiple processes, the datasets are stored into several arrow files. Can I simply load each of them and concatenate them?

lhoestq · August 24, 2021, 1:09pm

Yes you can load them and concatenate them
That’s exactly what map does under the hood

ezio98 · August 27, 2021, 7:51am

Thanks, I get it.

Topic		Replies	Views
Load dataset from a specific cache file 🤗Datasets	3	1256	February 26, 2024
[solved] How to load multiple arrow files into one dataset Beginners	4	3014	September 16, 2023
Recover Cached Tmp Files During Mapping 🤗Datasets	2	108	November 8, 2024
Loading dataset from cache .arrow file 🤗Datasets	1	749	March 28, 2023
Sharing ArrowDataset with subfolders 🤗Datasets	8	34	March 11, 2025

[urgent]Can you reconstruct datasets using the cache file (.arrow file)?

Related topics