Making multiple samples from single samples using HuggingFace Datasets

My use case involved building multiple samples from a single sample. Is there any way I can do that with Datasets.map().

Just a view of what I need to do:

# this is how my dataset looks like
dataset = [(1, 2, 3), (5, 7, 8)] 

# this is how my output dataset should look like
output = [1, 2, 3, 5, 7, 8]

** these list & tuple are just for making my point more clear, in actual its a Huggingface dataset šŸ˜‰.

Can I do something like that with Datasets.map() or any other way to do this??

cc @lhoestq

Hi ! Yes you can definitely do that using a batched map call.
Basically you can return more examples than the number of examples in per batch:

from datasets import Dataset

d = Dataset.from_dict({"foo": [(1, 2, 3), (5, 7, 8)]})
d = d.map(lambda x: {"foo": [i for row in x["foo"] for i in row]}, batched=True)

print(d["foo"])
# [1, 2, 3, 5, 7, 8]
3 Likes

Woww!! Thankyou so much.

Hi @lhoestq ,

Thanks for the insight. I have been rethinking this solution, wondering is there any way to do the same job without using ā€œbatched=Trueā€?

Thanks!

Hi @oliversn , to apply a function that makes several examples out of one, you still have to use batched=True because the output is a batch. Note that you can set the batch_size to 1 if you only want to give your function a batch of one example at a time.

1 Like

Much appreciated @lhoestq! That makes totally sense. :slight_smile:

One important note: If yourā€™re doing this, you canā€™t map to a new column name. You must overwrite the original column name or use remove_columns as in:

from datasets import Dataset

d = Dataset.from_dict({"foo": [(1, 2, 3), (5, 7, 8)]})
d = d.map(lambda x: {"bar": [i for row in x["foo"] for i in row]}, batched=True, remove_columns="foo")

print(d)
# [1, 2, 3, 5, 7, 8]