Hi,
Basically I construct a dataset through the mapping function in jupyter notebook. I can find the cache file (.arrow). Is there a way I can reconstruct the dataset similar to “datasets.load_from_disk()”
Hi,
Basically I construct a dataset through the mapping function in jupyter notebook. I can find the cache file (.arrow). Is there a way I can reconstruct the dataset similar to “datasets.load_from_disk()”
Sure ! If you only have the .arrow file you can do
from datasets import Dataset
dataset = Dataset.from_file(path_to_arrow_file)
It works! thank you so much!
I think your answer can easily load if the arrow_file is single. But when I process the data using multiple processes, the datasets are stored into several arrow files. Can I simply load each of them and concatenate them?
Yes you can load them and concatenate them
That’s exactly what map
does under the hood
Thanks, I get it.