[urgent]Can you reconstruct datasets using the cache file (.arrow file)?

Hi,

Basically I construct a dataset through the mapping function in jupyter notebook. I can find the cache file (.arrow). Is there a way I can reconstruct the dataset similar to “datasets.load_from_disk()”

2 Likes

Sure ! If you only have the .arrow file you can do

from datasets import Dataset 

dataset = Dataset.from_file(path_to_arrow_file)
3 Likes

It works! thank you so much!

I think your answer can easily load if the arrow_file is single. But when I process the data using multiple processes, the datasets are stored into several arrow files. Can I simply load each of them and concatenate them?

Yes you can load them and concatenate them :slight_smile:
That’s exactly what map does under the hood

Thanks, I get it.