Filtering Dataset

I’m trying to filter a dataset based on the ids in a list. This approach is too slow. The dataset is an Arrow dataset.

responses = load_dataset('peixian/rtGender', 'responses', split = 'train')
# post_id_test_list contains list of ids
responses_test = responses.filter(lambda x: x['post_id'] in post_id_test_list)

Hi baumstan.

I’m not sure I understand the question. Why does it matter if it is slow?

I would expect you to create and then save your train/test datasets only once, before you start using your model. If it takes a long time, just leave it running.

Are you trying to use a dynamic post_id_test_list, or to train with transient data, or what?

I suspect you might find better answers on Stack Overflow, as this doesn’t look like a Huggingface-specific question.

Have tried Stackoverflow. python - pyarrow Table Filtering -- huggingface - Stack Overflow

I think setting input_columns is what you want. See my answer on Stack Overflow.