Hi ! You can use map
in batched mode to return more rows than the input:
from datasets import Dataset
dataset = Dataset.from_dict({"foo": ["aaa", "bbb", "ccc"]}) # 3 rows
def augment(batch):
return {"bar": [character for foo in batch["foo"] for character in foo]}
dataset = dataset.map(augment, batched=True, remove_columns=dataset.column_names)
print(dataset["bar"]) # now it has 9 rows
# ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c']
Note that I had to pass remove_columns=dataset.column_names
in order to trash the old column “foo” that doesn’t have 9 rows