Hi All,
I’m trying to filter a dataset that has about 1 million rows and about a dozen columns (some of those are Array2D, so the arrow files is ~100GB, but I don’t think that’s very relevant).
The column I’m filtering on is made up of 100-300 character long strings, and I have a set of strings that I want to exclude.
Each of the follwing takes about 1min 30s to execute:
a = dataset.filter(lambda x: not x in exclude, input_columns=['column'])
a = dataset.filter(lambda x: not x in exclude, input_columns=['column'], batch_size=None)
a = dataset.filter(lambda xs: [x not in exclude for x in xs], input_columns=['column'], batched=True)
a = dataset.filter(lambda xs: [x not in exclude for x in xs], input_columns=['column'], batch_size=None, batched=True)
But this only takes 1.85s:
a = dataset['column']
b = [i for i, x in enumerate(a) if x not in exclude]
c = dataset.select(b)
Is this expected, or am I doing something wrong with the dataset?
(The other option solves the problem for me, I’m more interested as to why such a difference)