[solved] How to load multiple arrow files into one dataset

Hello

I am trying learn how to do embeddings etc. And in an earlier run, had saved some embeddings into arrow files using dataset.save_to_disk( ) which generated two files data-00000-of-00002.arrow and data-00001-of-00002.arrow

Today, I want to load these two files into one dataset.

what Ive’ tried: Dataset.from_file("/content/drive/MyDrive/data-00000-of-00002.arrow", "/content/drive/MyDrive/data-00001-of-00002.arrow"), that is, passing the two files to from_file

but this gives me an error AttributeError: 'str' object has no attribute 'copy'

So how do i load these two into one dataset?

expected outcome: the orignial dataset that I had created and saved to disk is re-created from these two files.

thank you

Update: Unsure if this is a workaround or something but I used from datasets import load_dataset and then data = load_dataset("arrow", data_files={file1, file2, file3}) instead of Dataset.from_file and this works. I am kinda glad my question went into the Akismet queue, because it made me want to try alternatives :slight_smile: If someone could tell me how to use the Dataset.from_file function do the same, I would love that.

Cheers

I have the same problem. Load from multiple arrow files into a dataset. Similar to the result after the map(). I want to process the arrow files I need in advance

HI Enze, what do you mean process the files in advance? Are you able to load the files into a dataset? please add some details on what you’re trying to do

You can load multiple Arrow files using Dataset.from_file (assuming their schemas match, so no additional casts are required) as follows:

from datasets import concatenate_datasets, Dataset

ds = concatenate_datasets([Dataset.from_file(arrow_file) for arrow_file in arrow_files])
2 Likes

such as, to processing a dataset into multiple arrow files and then using load_ from_disk methods