Filtering performance

lhoestq · March 5, 2025, 2:33pm

Yes correct, filter() only stores the indices to save disk space.

For people who want to rewrite the dataset completely (e.g. to end up with contiguous data and get faster reads), there is ds.flatten_indices() that rewrites the dataset and removes the indices mapping

Topic		Replies	Views
Is `flatten_indices` needed after a `filter`? 🤗Datasets	1	265	July 14, 2023
Filtering Dataset Beginners	3	5690	April 8, 2024
Filter Large Dataset Entry by Entry 🤗Datasets	7	174	August 28, 2024
Datasets behaving strange when calling filter twice 🤗Datasets	3	419	October 6, 2021
Index retrieval speed varies considerably with dataset size 🤗Datasets	2	866	May 9, 2022

Filtering performance

Related topics