Dealing with large objects as arguments in datasets.map

Hi, I’ve recently noticed that fingerprinting takes a lot of time when function arguments are large, e.g.

import datasets
from datasets import load_dataset
# Random dataset
dataset = load_dataset('glue', 'mrpc', split='test')

def do_nothing(examples, important_dict):
    return examples

# A very big dictionary
important_dict = {}
for i in range(10**8):
    important_dict[i] = 'nothing'

dataset = dataset.map(
    lambda examples: do_nothing(examples, important_dict),
    batched=True,
    load_from_cache_file=False,
    num_proc=10, # The larger the slower
    keep_in_memory=True) 

Fingerprinting of the do_nothing function takes a lot of time even when only one processor is used, but for many, it is basically N times more (which can be days when the arguments of the do_nothing function are really big).

It would be great if it was possible to disable fingerprinting altogether, or somehow exclude certain function arguments from being fingerprinted (like the imporant_dict), or if not exclude then mark them as immutable so that the fingerprint can be calculated only once. Or am I doing something completely wrong and there is a better way?

System info:

  • Ubuntu
  • Python 3.8
  • datasets==1.14.0

Hi ! Computing the fingerprint of the mapped dataset is necessary for the caching mechanism to work.
So you can disable this with set_caching_enabled(True), but every time you re-run your code it will recompute the map call.

The fingerprint is computed by hashing the code and the variables in your map function. So it takes time because it hashes your big dictionary. It could be helpful if you could specify the hash of your dictionary in advance maybe. Or if you could specify in advance the fingerprint of the resulting dataset. What do you think ?

Thank you for the explanation. I get what set_caching_enabled does, but it does not disable fingerprinting of .map arguments.

Regarding the provision of hashes in advance, this is probably the easiest option, I’ve tested a quick workaround where the update_fingerprint receives another argument precalculated_hashes - dictionary[name, hash] that I’ve provided to the .map function. In my case, I’ve provided a hash for the whole function (ie do_nothing from the example). Does this seem right?

Irrelevant of this, would be still great if set_caching_enabled also killed fingerprinting - unless it is also used for something else.