Hi, just started using the Huggingface library. I am wondering how can I pass model
and tokenizer
to my processing function along with the batch when using the map
method.
def my_processing_func(batch, model, tokenizer):
–code–
I am using map like this…
new_dataset = my_dataset.map(my_processing_func, model, tokenizer, batched=True)
when I do this it does not fail but instead of passing the dictionary with input_ids and attention_mask, it passes a list
of just input_ids as the batch to my_processing_func. When I remove the model and tokenizer argument then it sends the dictionary as expected.
Where am I going wrong?
Thanks in advance.
Hi! You can use fn_kwargs
to pass the arguments to the map function:
new_dataset = my_dataset.map(my_processing_func, batched=True, fn_kwargs={"model": model, "tokenizer": tokenizer})
Or you can use partial
:
from functools import partial
new_dataset = my_dataset.map(partial(my_processing_func, model=model, tokenizer=tokenizer), batched=True)
5 Likes
Is there any downside to using either options? If I remember correctly (?) lambdas are not picklable. So my assumption would be that if you do something like
new_dataset = my_dataset.map(lambda batch: my_processing_func(batch, model, tokenizer), batched=True)
it won’t be cached. Is that correct?
Super!! this works for me … thanks