Add_column() does not work if used on dataset sliced with select()

Hello all, say I have a dataset with 2000 entries

dataset = Dataset.from_dict({‘colA’: list(range(2000))})

and from which I want to extract the first one thousand rows, create a new dataset with these and also add a new column to it:

dataset2 = dataset.select(list(range(1000)))
final_dataset = dataset2.add_column(‘colB’, list(range(1000)))

This gives an error

ArrowInvalid: Added column’s length must match table’s length. Expected length 2000 but got length 1000

I’ve experimented with the arguments of the select method, but I did not find a way to surpass this error. Does anyone know why it’s happening and how to resolve it?

Thanks.

1 Like

Hi! Could you please open an issue in our GH repo because this looks like a bug in datasets?

In the meantime, call flatten_indices (dset.flatten_indices()) after select and before add_column.

2 Likes

Will do. What you suggested works as well, thanks.