Filter Large Dataset Entry by Entry

I am not sure if this is a new feature, but I wanted to post this problem here, and hear if others have ways of optimizing and speeding up this process.

Let’s say I have a really large dataset that I cannot load into memory. At this point, I am only aware of streaming=True to load the dataset. Now, the dataset consists of many tables. Ideally, I would want to have some simple filtering criterion, such that I only see the “good” tables. Here is an example of what the code might look like:

dataset = load_dataset(
    "really-large-dataset",
    streaming=True
)
# And let's say we process the dataset bit by bit because we want intermediate results
dataset = islice(dataset, 10000)

# Define a function to filter the data
def filter_function(table):
    if some_condition:
        return True
    else:
        return False

# Use the filter function on your dataset
filtered_dataset = (ex for ex in dataset if filter_function(ex))

And then I work on the processed dataset, which would be magnitudes faster than working on the original. I would love to hear if the problem setup + solution makes sense to people, and if anyone has suggestions!

Hi @QiyaoWei ,
Don’t know about streaming datasets but filter() looks like what your looking for?!

Best,
M

Hey @mikehemberger, I did look at filter(), but from what I understand, filter() applies specifically to a large table and filters the table row-by-row, not to a set of tables and filter table-by-table. Is there a functionality here that I am missing out on?

Hey again, seems like I misinterpreted your question / code example. Could you elaborate on what the entries of your sliced dataset consist of?
Best,
M

Hey @mikehemberger, sure thing! When streaming=True, the dataset object will become an IterableDataset object that can be used by next(iter(dataset)). In my case, next(iter(dataset)) = pd.Dataframe(), so that’s why I was not sure whether the filter() function could be applied to an iterator of dataframes, if that makes sense

From what I gather a single entry of your dataset is a dataframe, correct?
Would it be feasible then to check each entry of your data in respect to your condition and then filter on the resulting Boolean array?

Hey @mikehemberger, yes and yes. That is exactly what I did in the last line of the code I provided with the iterator using the filter_function. I was just wondering if I did it in the best way, but looks like that is the best solution

Hey @QiyaoWei ,
Maybe not the best but if it works in a reasonable amount of time still great! :smile:
Best,
M

1 Like