My use case involved building multiple samples from a single sample. Is there any way I can do that with Datasets.map()
.
Just a view of what I need to do:
# this is how my dataset looks like
dataset = [(1, 2, 3), (5, 7, 8)]
# this is how my output dataset should look like
output = [1, 2, 3, 5, 7, 8]
** these list & tuple are just for making my point more clear, in actual its a Huggingface dataset š.
Can I do something like that with Datasets.map() or any other way to do this??
cc @lhoestq
Hi ! Yes you can definitely do that using a batched map
call.
Basically you can return more examples than the number of examples in per batch:
from datasets import Dataset
d = Dataset.from_dict({"foo": [(1, 2, 3), (5, 7, 8)]})
d = d.map(lambda x: {"foo": [i for row in x["foo"] for i in row]}, batched=True)
print(d["foo"])
# [1, 2, 3, 5, 7, 8]
3 Likes
Hi @lhoestq ,
Thanks for the insight. I have been rethinking this solution, wondering is there any way to do the same job without using ābatched=Trueā?
Thanks!
Hi @oliversn , to apply a function that makes several examples out of one, you still have to use batched=True
because the output is a batch. Note that you can set the batch_size to 1 if you only want to give your function a batch of one example at a time.
1 Like
Much appreciated @lhoestq! That makes totally sense. 
One important note: If yourāre doing this, you canāt map to a new column name. You must overwrite the original column name or use remove_columns
as in:
from datasets import Dataset
d = Dataset.from_dict({"foo": [(1, 2, 3), (5, 7, 8)]})
d = d.map(lambda x: {"bar": [i for row in x["foo"] for i in row]}, batched=True, remove_columns="foo")
print(d)
# [1, 2, 3, 5, 7, 8]