[solved] How to load multiple arrow files into one dataset

aflip · August 3, 2023, 9:29am

Hello

I am trying learn how to do embeddings etc. And in an earlier run, had saved some embeddings into arrow files using dataset.save_to_disk( ) which generated two files data-00000-of-00002.arrow and data-00001-of-00002.arrow

Today, I want to load these two files into one dataset.

what Ive’ tried: Dataset.from_file("/content/drive/MyDrive/data-00000-of-00002.arrow", "/content/drive/MyDrive/data-00001-of-00002.arrow"), that is, passing the two files to from_file

but this gives me an error AttributeError: 'str' object has no attribute 'copy'

So how do i load these two into one dataset?

expected outcome: the orignial dataset that I had created and saved to disk is re-created from these two files.

thank you

Update: Unsure if this is a workaround or something but I used from datasets import load_dataset and then data = load_dataset("arrow", data_files={file1, file2, file3}) instead of Dataset.from_file and this works. I am kinda glad my question went into the Akismet queue, because it made me want to try alternatives If someone could tell me how to use the Dataset.from_file function do the same, I would love that.

Cheers

TurboPascal · August 25, 2023, 5:03am

I have the same problem. Load from multiple arrow files into a dataset. Similar to the result after the map(). I want to process the arrow files I need in advance

aflip · August 25, 2023, 8:45am

HI Enze, what do you mean process the files in advance? Are you able to load the files into a dataset? please add some details on what you’re trying to do

mariosasko · August 28, 2023, 6:22pm

You can load multiple Arrow files using Dataset.from_file (assuming their schemas match, so no additional casts are required) as follows:

from datasets import concatenate_datasets, Dataset

ds = concatenate_datasets([Dataset.from_file(arrow_file) for arrow_file in arrow_files])

TurboPascal · September 16, 2023, 7:01pm

such as, to processing a dataset into multiple arrow files and then using load_ from_disk methods

Topic		Replies	Views
[urgent]Can you reconstruct datasets using the cache file (.arrow file)? 🤗Datasets	5	1081	August 27, 2021
How to combine local data files with an official 🤗 dataset 🤗Datasets	15	3600	April 7, 2021
Load Dataset from arrow file 🤗Datasets	1	11747	October 27, 2022
Read CSV multi threading 🤗Datasets	5	1464	July 21, 2021
Sharing ArrowDataset with subfolders 🤗Datasets	8	44	March 11, 2025

[solved] How to load multiple arrow files into one dataset

Related topics