Skip rows with datasets.Dataset.map()

I am processing textual data. From each row in the dataset, I’d like to have from 0 to infinite number of rows in the new dataset, each having a portion of the textual data.

EDIT:

  • Is there a way to make from a single row multiple rows, i.e. how do I make multiple rows in the new dataset from a row in the old dataset?
  • Is there a way to skip rows, i.e. how do I make 0 rows in the new dataset from a row in the old dataset?

Dummy example, skip every 5th row:

dataset = datasets.load_from_disk(data_path)
print(dataset)
def mapping(row, idx):
    if idx % 5 == 0:
        return None # or {} # do not add a row to new mapped dataset
    return row

new_dataset = dataset.map(mapping,  with_indices=True)
print(new_dataset)

Gives:

Dataset({
    features: [...columns...],
    num_rows: 1924
})

Dataset({
    features: [...columns...],
    num_rows: 1924
})

You can use a batched map to return as many rows as you want given a batch of rows. See docs

Yes, using a batched map as mentioned earlier :slight_smile:

1 Like