How do I add things (rows) to an already saved dataset?

shahbuland · December 8, 2022, 12:14am

I save a dataset to disk. Later, I load it from disk in a different script. I add some items. I try to save to disk again. I get the error “PermissionError: Tried to overwrite [path]/dataset.arrow but a dataset can’t overwrite itself.”. How do I solve this? What is the correct way to add new rows the dataset on disk?

mariosasko · December 13, 2022, 1:59pm

Hi! Our library relies on PyArrow as a storage backend. PyArrow tables are immutable, so adding new rows means creating a new table, and since save_to_disk saves the entire table, saving it over and over could occupy a lot of disk space. So a more optimal approach here would be to save each sub-dataset (new items equals one sub-dataset) individually and load + concatenate (concatenate_datasets) the sub-datasets later.

Regarding the PermissionError, this happens when you try to save a dataset to a location already used by some of its sub-tables (due to PyArrow being immutable, new rows are kept in memory as sub-tables before being saved to disk). You can easily avoid this by saving the dataset to some other location.

PS: In the future, we could optimize appending and saving new rows by storing these rows in a separate file, which we would then put inside the existing save directory.

shahbuland · December 13, 2022, 9:14pm

Oh ok makes sense. I think in my use-case I may as well just use CSV files since there is realistically not going to be that much data (maybe 10 columns of text, max 100k rows)

GraemeSmith · June 11, 2023, 2:22pm

This feels embarrassingly stupid but what is the standard way to replace a saved to disk dataset with a modified/updated dataset. The obvious way would seem to be to save the modified dataset with a temp name, delete the original and then rename the temp to the original. But Python seems to keep hold of files in the original dataset and I can’t work out how to release them so that I can delete the dataset directory.

dataPlumber · April 11, 2024, 12:43pm

I would also like to know this.

Butanium · August 5, 2024, 5:15pm

Bump
I guess I’ll just create bunch of datasets for now
My use case is to add a column when I make my experiment on a new model

severo · August 8, 2024, 10:14am

add a column when I make my experiment on a new model

Interesting. Which format do you use to store your data?

Butanium · August 8, 2024, 2:59pm

I just store the llm next token prediction as an extra label

severo · August 8, 2024, 3:14pm

OK, but which file format? do you have an example dataset to share?

Butanium · August 8, 2024, 3:32pm

I just use hf datasets save_to_disk.
My dataset is some texts from fineweb and then for each model I add a last token id column and a last token predicted column. E.g.
text="I| love| Ber|kley", gt_gpt2={kley token id}, pred_gpt2={lin token id} (using | to show how the model would tokenize the string)

Topic		Replies	Views
Darshan Hiranandani : What methods are available for updating a saved dataset with new rows? 🤗Datasets	1	123	August 17, 2024
Saving checkpoints 🤗Datasets	1	877	November 5, 2021
Adding data to empty dataset object 🤗Datasets	3	3472	February 10, 2022
Error While Saving Dataset with PyArrow 🤗Datasets	0	70	December 14, 2024
How do you save an IterableDataset to disk? 🤗Datasets	3	756	November 18, 2024

How do I add things (rows) to an already saved dataset?

Related topics