How to create a new dataset from another dataset and select specific columns and the data along with the column?

How do you …

from datasets import load_dataset

dataset = load_dataset("some_dataset")

# is there something like that just takes specific columns and its data ...
new_dataset = dataset.cherry_pick(["this_column", "that_column"])

Hello :wave:

You can use the .remove_columns method on your dataset to select the columns that you don’t want, and this will give back a new dataset.

Thanks @beneyal

Let me ask this question another way

Say the dataset has 35 columns. I only need a dataset with two of the columns. What’s the most efficient way to just select the two columns out of the 35?

I don’t know the internals of the library, and the docs don’t mention a “cherry pick”-ish method, so the best way I see is using this:

>>> dataset
Dataset({
    features: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'],
    num_rows: 3
})

>>> cols_to_remove = dataset.column_names
>>> cols_to_remove.remove("B")
>>> cols_to_remove.remove("Z")
>>> dataset.remove_columns(cols_to_remove)
Dataset({
    features: ['B', 'Z'],
    num_rows: 3
})
1 Like

For what it’s worth, I have found that operations with references to the dataset itself, as in dataset.remove_columns(cols_to_remove) with cols_to_remove = dataset.column_names, breaks the ability to cache downstream map operations. Better to create a variable that is a list of all features ahead of time (if you can know it) and then diff that set with the set of cols to keep.

That said, I too would love a dataset.select_columns functionality as an addition to the API.

2 Likes