How do I add things (rows) to an already saved dataset?

I save a dataset to disk. Later, I load it from disk in a different script. I add some items. I try to save to disk again. I get the error “PermissionError: Tried to overwrite [path]/dataset.arrow but a dataset can’t overwrite itself.”. How do I solve this? What is the correct way to add new rows the dataset on disk?

Hi! Our library relies on PyArrow as a storage backend. PyArrow tables are immutable, so adding new rows means creating a new table, and since save_to_disk saves the entire table, saving it over and over could occupy a lot of disk space. So a more optimal approach here would be to save each sub-dataset (new items equals one sub-dataset) individually and load + concatenate (concatenate_datasets) the sub-datasets later.

Regarding the PermissionError, this happens when you try to save a dataset to a location already used by some of its sub-tables (due to PyArrow being immutable, new rows are kept in memory as sub-tables before being saved to disk). You can easily avoid this by saving the dataset to some other location.

PS: In the future, we could optimize appending and saving new rows by storing these rows in a separate file, which we would then put inside the existing save directory.

Oh ok makes sense. I think in my use-case I may as well just use CSV files since there is realistically not going to be that much data (maybe 10 columns of text, max 100k rows)