I save a dataset to disk. Later, I load it from disk in a different script. I add some items. I try to save to disk again. I get the error “PermissionError: Tried to overwrite [path]/dataset.arrow but a dataset can’t overwrite itself.”. How do I solve this? What is the correct way to add new rows the dataset on disk?
Hi! Our library relies on PyArrow as a storage backend. PyArrow tables are immutable, so adding new rows means creating a new table, and since save_to_disk
saves the entire table, saving it over and over could occupy a lot of disk space. So a more optimal approach here would be to save each sub-dataset (new items equals one sub-dataset) individually and load + concatenate (concatenate_datasets
) the sub-datasets later.
Regarding the PermissionError, this happens when you try to save a dataset to a location already used by some of its sub-tables (due to PyArrow being immutable, new rows are kept in memory as sub-tables before being saved to disk). You can easily avoid this by saving the dataset to some other location.
PS: In the future, we could optimize appending and saving new rows by storing these rows in a separate file, which we would then put inside the existing save directory.
Oh ok makes sense. I think in my use-case I may as well just use CSV files since there is realistically not going to be that much data (maybe 10 columns of text, max 100k rows)
This feels embarrassingly stupid but what is the standard way to replace a saved to disk dataset with a modified/updated dataset. The obvious way would seem to be to save the modified dataset with a temp name, delete the original and then rename the temp to the original. But Python seems to keep hold of files in the original dataset and I can’t work out how to release them so that I can delete the dataset directory.
I would also like to know this.
Bump
I guess I’ll just create bunch of datasets for now
My use case is to add a column when I make my experiment on a new model
add a column when I make my experiment on a new model
Interesting. Which format do you use to store your data?
I just store the llm next token prediction as an extra label
OK, but which file format? do you have an example dataset to share?
I just use hf datasets save_to_disk
.
My dataset is some texts from fineweb and then for each model I add a last token id column and a last token predicted column. E.g.
text="I| love| Ber|kley", gt_gpt2={kley token id}, pred_gpt2={lin token id}
(using | to show how the model would tokenize the string)