I’m trying to filter a dataset based on the ids in a list. This approach is too slow. The dataset is an Arrow dataset.
responses = load_dataset('peixian/rtGender', 'responses', split = 'train')
# post_id_test_list contains list of ids
responses_test = responses.filter(lambda x: x['post_id'] in post_id_test_list)
Hi baumstan.
I’m not sure I understand the question. Why does it matter if it is slow?
I would expect you to create and then save your train/test datasets only once, before you start using your model. If it takes a long time, just leave it running.
Are you trying to use a dynamic post_id_test_list, or to train with transient data, or what?
I suspect you might find better answers on Stack Overflow, as this doesn’t look like a Huggingface-specific question.
I think setting input_columns
is what you want. See my answer on Stack Overflow.