My question is similar to this issue but this time it’s about filtering.
How can I filter samples on-the-fly without going through the process of writing another cache file. I was able to preprocess samples using datasets.Dataset.set_transform()
I think you would have to use .filter() before-hand. I don’t think you can do that on-the-fly on a “map-style dataset” or you can create uintuitive situations. For example what if the first example has to be filtered out and your data loader asks for train_dataset[0] ? Should it raise an error ?
There is another type of dataset that can actually filter on-the-fly: “iterable datasets” (or “streaming datasets”). Currently you can load a dataset in streaming mode with load_dataset(..., streaming=True). The filter method is still to be implemented, though it should be possible to filter-out examples using a batched map, e.g.
def process_and_filter(batch):
"""This function passed to map can return less examples that the input batches"""
return {"text": text for text in batch["text"] if len(text) > 0}
ds = load_dataset("path/to/dataset", streaming=True, split="train")
ds = ds.map(process_and_filter, batched=True)