How to use `map` or similar when one row is mapped to multiple rows?

Suppose I have a dataset with 100 rows and I have a func that could turn each row into 10 rows. What I want is a mapped dataset that has 1000 rows. Is there a way I could do it using the package? Currently I got a length mismatch issue when using map.

rrowInvalid: Column 1 named test_col expected length 100 but got length 1000

Hi ! You can use map in batched mode to return more rows than the input:

from datasets import Dataset

dataset = Dataset.from_dict({"foo": ["aaa", "bbb", "ccc"]})  # 3 rows

def augment(batch):
   return {"bar": [character for foo in batch["foo"] for character in foo]}

dataset = dataset.map(augment, batched=True, remove_columns=dataset.column_names)
print(dataset["bar"])  # now it has 9 rows
# ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c']

Note that I had to pass remove_columns=dataset.column_names in order to trash the old column “foo” that doesn’t have 9 rows

5 Likes