Add_column() does not work if used on dataset sliced with select()

ThomasG · January 19, 2022, 1:01pm

Hello all, say I have a dataset with 2000 entries

dataset = Dataset.from_dict({‘colA’: list(range(2000))})

and from which I want to extract the first one thousand rows, create a new dataset with these and also add a new column to it:

dataset2 = dataset.select(list(range(1000)))
final_dataset = dataset2.add_column(‘colB’, list(range(1000)))

This gives an error

ArrowInvalid: Added column’s length must match table’s length. Expected length 2000 but got length 1000

I’ve experimented with the arguments of the select method, but I did not find a way to surpass this error. Does anyone know why it’s happening and how to resolve it?

Thanks.

mariosasko · January 19, 2022, 1:15pm

Hi! Could you please open an issue in our GH repo because this looks like a bug in datasets?

In the meantime, call flatten_indices (dset.flatten_indices()) after select and before add_column.

ThomasG · January 19, 2022, 1:31pm

Will do. What you suggested works as well, thanks.

Topic		Replies	Views
Add new column to a dataset 🤗Datasets	8	4976	January 18, 2024
How to add a new column to a dataset 🤗Datasets	11	35666	October 3, 2023
Adding data to empty dataset object 🤗Datasets	3	3476	February 10, 2022
Add column with a particular type in datasets 🤗Datasets	2	375	July 5, 2024
Querying column is slow for datasets with indices mapping 🤗Datasets	3	1486	May 17, 2021

Add_column() does not work if used on dataset sliced with select()

Related topics