I think I’m seeing unexpected behavior when calling filter twice in succession. I wonder if it’s related to caching?
Here’s an easy reproduction:
from datasets import load_dataset
def debug_log(key, d):
print(key, len(d), set(d['language']))
d = load_dataset('tydiqa', 'primary_task', split='validation')
debug_log('d', d)
d_filter_0 = d.filter(lambda x: x['language'] == 'english')
debug_log('d_filter_0', d_filter_0)
d_filter_1 = d_filter_0.filter(lambda x: x['language'] == 'english')
debug_log('d_filter_1', d_filter_1)
d_filter_2 = d.filter(lambda x: x['language'] == 'english')
debug_log('d_filter_2', d_filter_2)
Output (notice that languages for d_filter_0 != d_filter_1):
Reusing dataset tydiqa (/Users/adrozdov/.cache/huggingface/datasets/tydiqa/primary_task/1.0.0/b8a6c4c0db10bf5703d7b36645e5dbae821b8c0e902dac9daeecd459a8337148)
d 18670 {'bengali', 'swahili', 'english', 'thai', 'telugu', 'indonesian', 'russian', 'arabic', 'finnish', 'japanese', 'korean'}
Loading cached processed dataset at /Users/adrozdov/.cache/huggingface/datasets/tydiqa/primary_task/1.0.0/b8a6c4c0db10bf5703d7b36645e5dbae821b8c0e902dac9daeecd459a8337148/cache-1c4d9d00950a4a3d.arrow
d_filter_0 1031 {'english'}
Loading cached processed dataset at /Users/adrozdov/.cache/huggingface/datasets/tydiqa/primary_task/1.0.0/b8a6c4c0db10bf5703d7b36645e5dbae821b8c0e902dac9daeecd459a8337148/cache-6c4a6a69baa0ff73.arrow
d_filter_1 1031 {'bengali', 'swahili', 'english', 'thai', 'telugu', 'indonesian', 'russian', 'arabic', 'finnish', 'japanese', 'korean'}
Loading cached processed dataset at /Users/adrozdov/.cache/huggingface/datasets/tydiqa/primary_task/1.0.0/b8a6c4c0db10bf5703d7b36645e5dbae821b8c0e902dac9daeecd459a8337148/cache-1c4d9d00950a4a3d.arrow
d_filter_2 1031 {'english'}
It would be great to call filter twice! Basically, can filter the first time by language, and the second time by something else like length.