How to use `map` or similar when one row is mapped to multiple rows?

mralexis · July 15, 2021, 6:46am

Suppose I have a dataset with 100 rows and I have a func that could turn each row into 10 rows. What I want is a mapped dataset that has 1000 rows. Is there a way I could do it using the package? Currently I got a length mismatch issue when using map.

rrowInvalid: Column 1 named test_col expected length 100 but got length 1000

lhoestq · July 20, 2021, 9:35am

Hi ! You can use map in batched mode to return more rows than the input:

from datasets import Dataset

dataset = Dataset.from_dict({"foo": ["aaa", "bbb", "ccc"]})  # 3 rows

def augment(batch):
   return {"bar": [character for foo in batch["foo"] for character in foo]}

dataset = dataset.map(augment, batched=True, remove_columns=dataset.column_names)
print(dataset["bar"])  # now it has 9 rows
# ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c']

Note that I had to pass remove_columns=dataset.column_names in order to trash the old column “foo” that doesn’t have 9 rows

Topic		Replies	Views
Skip rows with datasets.Dataset.map() 🤗Datasets	1	1725	January 3, 2023
I have a dataset of texts that I want to split into shorter texts 🤗Datasets	1	1057	October 16, 2023
Mapping 1 multi-element column of a dataset to multi row dataset with 1 element per row, duplicating other features 🤗Datasets	6	2529	November 4, 2022
One-to-many batch mapping with IterableDatasets and batch_size=1 doesn't work 🤗Datasets	2	23	April 14, 2025
Dataset.map() with batching and multiprocessing 🤗Datasets	1	287	March 5, 2024

How to use `map` or similar when one row is mapped to multiple rows?

Related topics