Dataset map method - how to pass argument to the function

sssingh · March 30, 2022, 6:36pm

Hi, just started using the Huggingface library. I am wondering how can I pass model and tokenizer to my processing function along with the batch when using the map method.

def my_processing_func(batch, model, tokenizer):
–code–

I am using map like this…
new_dataset = my_dataset.map(my_processing_func, model, tokenizer, batched=True)

when I do this it does not fail but instead of passing the dictionary with input_ids and attention_mask, it passes a list of just input_ids as the batch to my_processing_func. When I remove the model and tokenizer argument then it sends the dictionary as expected.

Where am I going wrong?

Thanks in advance.

mariosasko · March 31, 2022, 11:24am

Hi! You can use fn_kwargs to pass the arguments to the map function:

new_dataset = my_dataset.map(my_processing_func, batched=True, fn_kwargs={"model": model, "tokenizer": tokenizer})

Or you can use partial:

from functools import partial
new_dataset = my_dataset.map(partial(my_processing_func, model=model, tokenizer=tokenizer), batched=True)

BramVanroy · March 31, 2022, 11:51am

Is there any downside to using either options? If I remember correctly (?) lambdas are not picklable. So my assumption would be that if you do something like

new_dataset = my_dataset.map(lambda batch: my_processing_func(batch, model, tokenizer), batched=True)

it won’t be cached. Is that correct?

sssingh · March 31, 2022, 1:05pm

Super!! this works for me … thanks

mariosasko · April 5, 2022, 12:16pm

There shouldn’t be a significant difference in speed between these two approaches.
We use dill, which knows how to pickle lamdas in most situations.

Topic		Replies	Views
Can dataset.map accept multiple arguments like python map 🤗Datasets	3	5726	April 20, 2023
How to use dataset with costume function? Beginners	3	842	June 19, 2023
Making multiple samples from single samples using HuggingFace Datasets 🤗Datasets	6	2653	March 3, 2024
Setting an array with a sequence using Huggingface dataset map() Beginners	1	1478	February 17, 2022
Setting an array with a sequence using Huggingface dataset map() when running a colab notebook Beginners	0	748	July 30, 2021

Dataset map method - how to pass argument to the function

Related topics