Filtering performance

grakocevic · December 22, 2022, 6:23pm

Hi All,

I’m trying to filter a dataset that has about 1 million rows and about a dozen columns (some of those are Array2D, so the arrow files is ~100GB, but I don’t think that’s very relevant).
The column I’m filtering on is made up of 100-300 character long strings, and I have a set of strings that I want to exclude.

Each of the follwing takes about 1min 30s to execute:

a = dataset.filter(lambda x: not x in exclude, input_columns=['column'])
a = dataset.filter(lambda x: not x in exclude, input_columns=['column'],  batch_size=None)
a = dataset.filter(lambda xs: [x not in exclude for x in xs], input_columns=['column'], batched=True)
a = dataset.filter(lambda xs: [x not in exclude for x in xs], input_columns=['column'], batch_size=None, batched=True)

But this only takes 1.85s:

a = dataset['column']
b = [i for i, x in enumerate(a) if x not in exclude]
c = dataset.select(b)

Is this expected, or am I doing something wrong with the dataset?
(The other option solves the problem for me, I’m more interested as to why such a difference)

mariosasko · December 22, 2022, 7:08pm

Hi! Which version of datasets are you using? We’ve made some improvements in the latest release (2.8.0) to optimize decoding, so use this version for the best performance.

Also, unlike select (creates an indices mapping), filter writes a new dataset to disk/memory, which can take some time for larger datasets (some benefits are faster indexing, etc.)

grakocevic · December 22, 2022, 10:37pm

I’m on 2.7.1, I’ll try upgrading. But writing the new dataset to probably explains the difference, as there is close to 100GB to be written out, eve after the filtering

lhoestq · January 3, 2023, 2:09pm

That’s not true - filter does add an indices mapping. We should update the docstring

abdulfatir · March 4, 2025, 4:06pm

@lhoestq can I confirm that only the valid (according to filter criteria) indices are cached by filter() and it does not actually create a cached copy of the dataset with valid entries?

lhoestq · March 5, 2025, 2:33pm

Yes correct, filter() only stores the indices to save disk space.

For people who want to rewrite the dataset completely (e.g. to end up with contiguous data and get faster reads), there is ds.flatten_indices() that rewrites the dataset and removes the indices mapping

Topic		Replies	Views
Is `flatten_indices` needed after a `filter`? 🤗Datasets	1	265	July 14, 2023
Filtering Dataset Beginners	3	5672	April 8, 2024
Filter Large Dataset Entry by Entry 🤗Datasets	7	172	August 28, 2024
Datasets behaving strange when calling filter twice 🤗Datasets	3	419	October 6, 2021
Index retrieval speed varies considerably with dataset size 🤗Datasets	2	865	May 9, 2022

Filtering performance

Related topics