How to filter samples on-the-fly?

My question is similar to this issue but this time it’s about filtering.

How can I filter samples on-the-fly without going through the process of writing another cache file. I was able to preprocess samples using datasets.Dataset.set_transform()

train_dataset = make_dataset("my_dataset")
train_dataset = datasets.Dataset(train_dataset.data)
train_dataset.set_transform(preprocess)

but how can I do something similar as I want to filter?

I think you would have to use .filter() before-hand. I don’t think you can do that on-the-fly on a “map-style dataset” or you can create uintuitive situations. For example what if the first example has to be filtered out and your data loader asks for train_dataset[0] ? Should it raise an error ?

There is another type of dataset that can actually filter on-the-fly: “iterable datasets” (or “streaming datasets”). Currently you can load a dataset in streaming mode with load_dataset(..., streaming=True). The filter method is still to be implemented, though it should be possible to filter-out examples using a batched map, e.g.

def process_and_filter(batch):
    """This function passed to map can return less examples that the input batches"""
    return {"text": text for text in batch["text"] if len(text) > 0}

ds = load_dataset("path/to/dataset", streaming=True, split="train")
ds = ds.map(process_and_filter, batched=True)