I save a dataset to disk. Later, I load it from disk in a different script. I add some items. I try to save to disk again. I get the error “PermissionError: Tried to overwrite [path]/dataset.arrow but a dataset can’t overwrite itself.”. How do I solve this? What is the correct way to add new rows the dataset on disk?
Hi! Our library relies on PyArrow as a storage backend. PyArrow tables are immutable, so adding new rows means creating a new table, and since
save_to_disk saves the entire table, saving it over and over could occupy a lot of disk space. So a more optimal approach here would be to save each sub-dataset (new items equals one sub-dataset) individually and load + concatenate (
concatenate_datasets) the sub-datasets later.
Regarding the PermissionError, this happens when you try to save a dataset to a location already used by some of its sub-tables (due to PyArrow being immutable, new rows are kept in memory as sub-tables before being saved to disk). You can easily avoid this by saving the dataset to some other location.
PS: In the future, we could optimize appending and saving new rows by storing these rows in a separate file, which we would then put inside the existing save directory.
Oh ok makes sense. I think in my use-case I may as well just use CSV files since there is realistically not going to be that much data (maybe 10 columns of text, max 100k rows)
This feels embarrassingly stupid but what is the standard way to replace a saved to disk dataset with a modified/updated dataset. The obvious way would seem to be to save the modified dataset with a temp name, delete the original and then rename the temp to the original. But Python seems to keep hold of files in the original dataset and I can’t work out how to release them so that I can delete the dataset directory.