Suppose I have a dataset with 100 rows and I have a func
that could turn each row into 10 rows. What I want is a mapped dataset that has 1000 rows. Is there a way I could do it using the package? Currently I got a length mismatch issue when using map.
rrowInvalid: Column 1 named test_col expected length 100 but got length 1000
Hi ! You can use map
in batched mode to return more rows than the input:
from datasets import Dataset
dataset = Dataset.from_dict({"foo": ["aaa", "bbb", "ccc"]}) # 3 rows
def augment(batch):
return {"bar": [character for foo in batch["foo"] for character in foo]}
dataset = dataset.map(augment, batched=True, remove_columns=dataset.column_names)
print(dataset["bar"]) # now it has 9 rows
# ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c']
Note that I had to pass remove_columns=dataset.column_names
in order to trash the old column “foo” that doesn’t have 9 rows
5 Likes