Datasets behaving strange when calling filter twice

I think I’m seeing unexpected behavior when calling filter twice in succession. I wonder if it’s related to caching?

Here’s an easy reproduction:

from datasets import load_dataset

def debug_log(key, d):
    print(key, len(d), set(d['language']))

d = load_dataset('tydiqa', 'primary_task', split='validation')
debug_log('d', d)
d_filter_0 = d.filter(lambda x: x['language'] == 'english')
debug_log('d_filter_0', d_filter_0)
d_filter_1 = d_filter_0.filter(lambda x: x['language'] == 'english')
debug_log('d_filter_1', d_filter_1)
d_filter_2 = d.filter(lambda x: x['language'] == 'english')
debug_log('d_filter_2', d_filter_2)

Output (notice that languages for d_filter_0 != d_filter_1):

Reusing dataset tydiqa (/Users/adrozdov/.cache/huggingface/datasets/tydiqa/primary_task/1.0.0/b8a6c4c0db10bf5703d7b36645e5dbae821b8c0e902dac9daeecd459a8337148)
d 18670 {'bengali', 'swahili', 'english', 'thai', 'telugu', 'indonesian', 'russian', 'arabic', 'finnish', 'japanese', 'korean'}
Loading cached processed dataset at /Users/adrozdov/.cache/huggingface/datasets/tydiqa/primary_task/1.0.0/b8a6c4c0db10bf5703d7b36645e5dbae821b8c0e902dac9daeecd459a8337148/cache-1c4d9d00950a4a3d.arrow
d_filter_0 1031 {'english'}
Loading cached processed dataset at /Users/adrozdov/.cache/huggingface/datasets/tydiqa/primary_task/1.0.0/b8a6c4c0db10bf5703d7b36645e5dbae821b8c0e902dac9daeecd459a8337148/cache-6c4a6a69baa0ff73.arrow
d_filter_1 1031 {'bengali', 'swahili', 'english', 'thai', 'telugu', 'indonesian', 'russian', 'arabic', 'finnish', 'japanese', 'korean'}
Loading cached processed dataset at /Users/adrozdov/.cache/huggingface/datasets/tydiqa/primary_task/1.0.0/b8a6c4c0db10bf5703d7b36645e5dbae821b8c0e902dac9daeecd459a8337148/cache-1c4d9d00950a4a3d.arrow
d_filter_2 1031 {'english'}

It would be great to call filter twice! Basically, can filter the first time by language, and the second time by something else like length.

Hey @mrdrozdov instead of chaining multiple filter operations, I think you can use conditionals in the lambda function like:

dset.filter(lambda x: x['language'] == 'english' and x['language'] == 'german')

Thanks! Yeah as long as I don’t chain filter it seems to work fine. Still seems like an issue though. Maybe should throw an exception if chaining filter?

Hi ! Yes this is a bug in versions 1.12.0 and 1.12.1
It has recently been fixed on master and we’ll do a new release.
see Fix filter leaking by lhoestq · Pull Request #3019 · huggingface/datasets · GitHub

1 Like