Dataset select function: retrieving the examples not selected

kloee · December 9, 2024, 10:57am

Hi,

Is there a good way of retrieving the examples that were filtered out when using the DatasetDict.filter() function ?

For now, I’m calling filter() on a DatasetDict that way:

datasets = datasets.filter(lambda example: not example['label_kept'] in labels_to_remove)

For now I compute the list before for each split, but I was wondering if there’s a better way to do that. I need the id of these removed examples to compare with the original dev/test file at the end.

Thanks

Topic		Replies	Views
Conditionally sample example from the dataset 🤗Datasets	1	369	November 24, 2021
Cannot display examples from IterableDataset 🤗Datasets	2	243	November 27, 2023
Filtering performance 🤗Datasets	5	2017	March 5, 2025
Initializing splits from existing Dataset objects 🤗Datasets	1	1219	April 7, 2022
How to slice an already loaded Dataset? 🤗Datasets	2	5768	December 16, 2022

Dataset select function: retrieving the examples not selected

Related topics