Hi, I’ve recently noticed that fingerprinting takes a lot of time when function arguments are large, e.g.
import datasets
from datasets import load_dataset
# Random dataset
dataset = load_dataset('glue', 'mrpc', split='test')
def do_nothing(examples, important_dict):
return examples
# A very big dictionary
important_dict = {}
for i in range(10**8):
important_dict[i] = 'nothing'
dataset = dataset.map(
lambda examples: do_nothing(examples, important_dict),
batched=True,
load_from_cache_file=False,
num_proc=10, # The larger the slower
keep_in_memory=True)
Fingerprinting of the do_nothing
function takes a lot of time even when only one processor is used, but for many, it is basically N times more (which can be days when the arguments of the do_nothing
function are really big).
It would be great if it was possible to disable fingerprinting altogether, or somehow exclude certain function arguments from being fingerprinted (like the imporant_dict
), or if not exclude then mark them as immutable so that the fingerprint can be calculated only once. Or am I doing something completely wrong and there is a better way?
System info:
- Ubuntu
- Python 3.8
- datasets==1.14.0