How to create a new dataset from another dataset and select specific columns and the data along with the column?

barlen · February 24, 2022, 11:46pm

How do you …

from datasets import load_dataset

dataset = load_dataset("some_dataset")

# is there something like that just takes specific columns and its data ...
new_dataset = dataset.cherry_pick(["this_column", "that_column"])

beneyal · February 25, 2022, 12:11am

Hello

You can use the .remove_columns method on your dataset to select the columns that you don’t want, and this will give back a new dataset.

barlen · February 25, 2022, 12:43am

Thanks @beneyal

Let me ask this question another way

Say the dataset has 35 columns. I only need a dataset with two of the columns. What’s the most efficient way to just select the two columns out of the 35?

beneyal · February 25, 2022, 12:57am

I don’t know the internals of the library, and the docs don’t mention a “cherry pick”-ish method, so the best way I see is using this:

>>> dataset
Dataset({
    features: ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'],
    num_rows: 3
})

>>> cols_to_remove = dataset.column_names
>>> cols_to_remove.remove("B")
>>> cols_to_remove.remove("Z")
>>> dataset.remove_columns(cols_to_remove)
Dataset({
    features: ['B', 'Z'],
    num_rows: 3
})

shababo · August 30, 2022, 6:37pm

For what it’s worth, I have found that operations with references to the dataset itself, as in dataset.remove_columns(cols_to_remove) with cols_to_remove = dataset.column_names, breaks the ability to cache downstream map operations. Better to create a variable that is a list of all features ahead of time (if you can know it) and then diff that set with the set of cols to keep.

That said, I too would love a dataset.select_columns functionality as an addition to the API.

Topic		Replies	Views
AttributeError: 'Dataset' object has no attribute 'remove_columns' Beginners	3	5471	August 20, 2023
Initializing splits from existing Dataset objects 🤗Datasets	1	1219	April 7, 2022
How to add a new column to a dataset 🤗Datasets	11	35505	October 3, 2023
Remove a row/specific index from the dataset 🤗Datasets	6	13280	February 8, 2025
Why does deleting the columns before giving it to interleave work but sometimes it does NOT work? Beginners	0	308	August 16, 2023

How to create a new dataset from another dataset and select specific columns and the data along with the column?

Related topics