I am not sure if this is a new feature, but I wanted to post this problem here, and hear if others have ways of optimizing and speeding up this process.
Let’s say I have a really large dataset that I cannot load into memory. At this point, I am only aware of streaming=True
to load the dataset. Now, the dataset consists of many tables. Ideally, I would want to have some simple filtering criterion, such that I only see the “good” tables. Here is an example of what the code might look like:
dataset = load_dataset(
"really-large-dataset",
streaming=True
)
# And let's say we process the dataset bit by bit because we want intermediate results
dataset = islice(dataset, 10000)
# Define a function to filter the data
def filter_function(table):
if some_condition:
return True
else:
return False
# Use the filter function on your dataset
filtered_dataset = (ex for ex in dataset if filter_function(ex))
And then I work on the processed dataset, which would be magnitudes faster than working on the original. I would love to hear if the problem setup + solution makes sense to people, and if anyone has suggestions!
Hi @QiyaoWei ,
Don’t know about streaming datasets but filter() looks like what your looking for?!
Best,
M
Hey @mikehemberger, I did look at filter(), but from what I understand, filter() applies specifically to a large table and filters the table row-by-row, not to a set of tables and filter table-by-table. Is there a functionality here that I am missing out on?
Hey again, seems like I misinterpreted your question / code example. Could you elaborate on what the entries of your sliced dataset consist of?
Best,
M
Hey @mikehemberger, sure thing! When streaming=True, the dataset object will become an IterableDataset object that can be used by next(iter(dataset)). In my case, next(iter(dataset)) = pd.Dataframe(), so that’s why I was not sure whether the filter() function could be applied to an iterator of dataframes, if that makes sense
From what I gather a single entry of your dataset is a dataframe, correct?
Would it be feasible then to check each entry of your data in respect to your condition and then filter on the resulting Boolean array?
Hey @mikehemberger, yes and yes. That is exactly what I did in the last line of the code I provided with the iterator using the filter_function. I was just wondering if I did it in the best way, but looks like that is the best solution
Hey @QiyaoWei ,
Maybe not the best but if it works in a reasonable amount of time still great! 
Best,
M
1 Like