Filtering Dataset

baumstan · September 22, 2021, 5:58pm

I’m trying to filter a dataset based on the ids in a list. This approach is too slow. The dataset is an Arrow dataset.

responses = load_dataset('peixian/rtGender', 'responses', split = 'train')
# post_id_test_list contains list of ids
responses_test = responses.filter(lambda x: x['post_id'] in post_id_test_list)

rgwatwormhill · September 24, 2021, 1:43pm

Hi baumstan.

I’m not sure I understand the question. Why does it matter if it is slow?

I would expect you to create and then save your train/test datasets only once, before you start using your model. If it takes a long time, just leave it running.

Are you trying to use a dynamic post_id_test_list, or to train with transient data, or what?

I suspect you might find better answers on Stack Overflow, as this doesn’t look like a Huggingface-specific question.

baumstan · September 26, 2021, 6:16pm

Have tried Stackoverflow. python - pyarrow Table Filtering -- huggingface - Stack Overflow

louislu9911 · April 8, 2024, 8:36am

I think setting input_columns is what you want. See my answer on Stack Overflow.

Topic		Replies	Views
Filtering performance 🤗Datasets	5	2021	March 5, 2025
Filter Large Dataset Entry by Entry 🤗Datasets	7	164	August 28, 2024
How to filter Datasets object? Beginners	3	603	June 6, 2024
Why is it so slow to access data through iteration with hugginface dataset? Intermediate	2	2849	July 21, 2022
Custom 20GB Arrow dataset very slow to train Beginners	1	64	March 20, 2025

Filtering Dataset

Related topics