Skip rows with datasets.Dataset.map()

roccofortuna · December 28, 2022, 7:30pm

I am processing textual data. From each row in the dataset, I’d like to have from 0 to infinite number of rows in the new dataset, each having a portion of the textual data.

EDIT:

Is there a way to make from a single row multiple rows, i.e. how do I make multiple rows in the new dataset from a row in the old dataset?
Is there a way to skip rows, i.e. how do I make 0 rows in the new dataset from a row in the old dataset?

Dummy example, skip every 5th row:

dataset = datasets.load_from_disk(data_path)
print(dataset)
def mapping(row, idx):
    if idx % 5 == 0:
        return None # or {} # do not add a row to new mapped dataset
    return row

new_dataset = dataset.map(mapping,  with_indices=True)
print(new_dataset)

Gives:

Dataset({
    features: [...columns...],
    num_rows: 1924
})

Dataset({
    features: [...columns...],
    num_rows: 1924
})

lhoestq · January 3, 2023, 10:57am

You can use a batched map to return as many rows as you want given a batch of rows. See docs

Yes, using a batched map as mentioned earlier

Topic		Replies	Views
How to use `map` or similar when one row is mapped to multiple rows? 🤗Datasets	1	2817	July 20, 2021
Mapping 1 multi-element column of a dataset to multi row dataset with 1 element per row, duplicating other features 🤗Datasets	6	2546	November 4, 2022
Map function skipping rows (only 8k out of 1.6M rows) 🤗Datasets	1	195	December 25, 2023
I have a dataset of texts that I want to split into shorter texts 🤗Datasets	1	1064	October 16, 2023
Dataset.map() with batching and multiprocessing 🤗Datasets	1	290	March 5, 2024

Skip rows with datasets.Dataset.map()

Related topics