Making multiple samples from single samples using HuggingFace Datasets

My use case involved building multiple samples from a single sample. Is there any way I can do that with Datasets.map().

Just a view of what I need to do:

# this is how my dataset looks like
dataset = [(1, 2, 3), (5, 7, 8)] 

# this is how my output dataset should look like
output = [1, 2, 3, 5, 7, 8]

** these list & tuple are just for making my point more clear, in actual its a Huggingface dataset 😉.

Can I do something like that with Datasets.map() or any other way to do this??

cc @lhoestq

Hi ! Yes you can definitely do that using a batched map call.
Basically you can return more examples than the number of examples in per batch:

from datasets import Dataset

d = Dataset.from_dict({"foo": [(1, 2, 3), (5, 7, 8)]})
d = d.map(lambda x: {"foo": [i for row in x["foo"] for i in row]}, batched=True)

print(d["foo"])
# [1, 2, 3, 5, 7, 8]
1 Like

Woww!! Thankyou so much.