I am processing textual data. From each row in the dataset, I’d like to have from 0 to infinite number of rows in the new dataset, each having a portion of the textual data.
EDIT:
- Is there a way to make from a single row multiple rows, i.e. how do I make multiple rows in the new dataset from a row in the old dataset?
- Is there a way to skip rows, i.e. how do I make 0 rows in the new dataset from a row in the old dataset?
Dummy example, skip every 5th row:
dataset = datasets.load_from_disk(data_path)
print(dataset)
def mapping(row, idx):
if idx % 5 == 0:
return None # or {} # do not add a row to new mapped dataset
return row
new_dataset = dataset.map(mapping, with_indices=True)
print(new_dataset)
Gives:
Dataset({
features: [...columns...],
num_rows: 1924
})
Dataset({
features: [...columns...],
num_rows: 1924
})