Remove a row/specific index from the dataset

zilong · December 16, 2021, 12:57am

Given the code

from datasets import load_dataset

dataset = load_dataset("glue", "mrpc", split='train')
idx = 0

How can I remove row 0 (dataset[0]) from this dataset?

The only way I can think of for now is using dataset.select(), and then selecting every index except 0, but that doesn’t seem efficient.

mariosasko · December 17, 2021, 11:25am

Hi!

You can do dataset = load_dataset("glue", "mrpc", split='train[1:]') to skip the first example while loading the dataset.

The only way I can think of for now is using dataset.select(), and then selecting every index except 0, but that doesn’t seem efficient.

Why do you think select is not efficient? It depends on the ops you use afterward, but select alone is very efficient as it only creates an indices mapping, which is (almost) equal to list(indices), and not a new PyArrow table.
`

zilong · December 17, 2021, 2:44pm

Hi Thank you for your reply.

The issue is that I need to remove random rows from the dataset. So not just idx = 0. But more like idxs =[ 76, 3, 384,10]. Currently I do this by selecting every index that is not in idxs. Which works, but I feel like there should be a better way to do it.

mariosasko · December 21, 2021, 1:08pm

Which works, but I feel like there should be a better way to do it.

“Better way” in terms of the API design? If yes, do you have an API in mind? Or better in terms of speed?
Removing rows is not easy to implement (efficiently) because PyArorw tables, which datasets use to store data, are immutable. You could use pandas for that (ds.to_pandas()) if your dataset is not too big and can fit in memory.

mkdeeperinsights · September 21, 2022, 4:47pm

In summary, it seems the current solution is to select all of the ids except the ones you don’t want.

So in this example, something like:

from datasets import load_dataset

# load dataset
dataset = load_dataset("glue", "mrpc", split='train')

# what we don't want
exclude_idx = [76, 3, 384, 10]

# create new dataset exluding those idx
dataset = dataset.select(
    (
        i for i in range(len(dataset)) 
        if i not in set(exclude_idx)
    )
)

ykeselman · December 4, 2024, 3:04am

For my reasonably sized dataset of 15K rows, converting it to a pandas.DataFrame first, doing filtering there, then converting back, worked the fastest. Use dataset.to_pandas and Dataset.from_pandas.

ChIrish06 · February 8, 2025, 6:05am

This is the best solution I found on this page. Converting to pandas seemed to take a long time, I used a mask instead of an exclusion set and it ran through 30K rows in less than a second.

Topic		Replies	Views
Most efficient way to retrieve N rows for a subset of columns 🤗Datasets	2	1524	November 3, 2021
How to create a new dataset from another dataset and select specific columns and the data along with the column? Beginners	4	11045	August 30, 2022
Is `dataset.select(range(10000))` efficient? 🤗Datasets	1	350	July 18, 2023
Dataset select function: retrieving the examples not selected 🤗Datasets	0	34	December 9, 2024
Index retrieval speed varies considerably with dataset size 🤗Datasets	2	866	May 9, 2022

Remove a row/specific index from the dataset

Related topics