Dealing with large objects as arguments in datasets.map

wish · October 20, 2021, 6:16pm

Hi, I’ve recently noticed that fingerprinting takes a lot of time when function arguments are large, e.g.

import datasets
from datasets import load_dataset
# Random dataset
dataset = load_dataset('glue', 'mrpc', split='test')

def do_nothing(examples, important_dict):
    return examples

# A very big dictionary
important_dict = {}
for i in range(10**8):
    important_dict[i] = 'nothing'

dataset = dataset.map(
    lambda examples: do_nothing(examples, important_dict),
    batched=True,
    load_from_cache_file=False,
    num_proc=10, # The larger the slower
    keep_in_memory=True)

Fingerprinting of the do_nothing function takes a lot of time even when only one processor is used, but for many, it is basically N times more (which can be days when the arguments of the do_nothing function are really big).

It would be great if it was possible to disable fingerprinting altogether, or somehow exclude certain function arguments from being fingerprinted (like the imporant_dict), or if not exclude then mark them as immutable so that the fingerprint can be calculated only once. Or am I doing something completely wrong and there is a better way?

System info:

Ubuntu
Python 3.8
datasets==1.14.0

lhoestq · October 21, 2021, 9:06am

Hi ! Computing the fingerprint of the mapped dataset is necessary for the caching mechanism to work.
So you can disable this with set_caching_enabled(True), but every time you re-run your code it will recompute the map call.

The fingerprint is computed by hashing the code and the variables in your map function. So it takes time because it hashes your big dictionary. It could be helpful if you could specify the hash of your dictionary in advance maybe. Or if you could specify in advance the fingerprint of the resulting dataset. What do you think ?

wish · October 21, 2021, 3:39pm

Thank you for the explanation. I get what set_caching_enabled does, but it does not disable fingerprinting of .map arguments.

Regarding the provision of hashes in advance, this is probably the easiest option, I’ve tested a quick workaround where the update_fingerprint receives another argument precalculated_hashes - dictionary[name, hash] that I’ve provided to the .map function. In my case, I’ve provided a hash for the whole function (ie do_nothing from the example). Does this seem right?

Irrelevant of this, would be still great if set_caching_enabled also killed fingerprinting - unless it is also used for something else.

Topic		Replies	Views
The datasets.map function does not load cached dataset Beginners	7	2269	November 21, 2023
Avoiding hashing in `map` 🤗Datasets	1	48	January 6, 2025
How to deal with unpickable objects in map 🤗Datasets	9	4545	October 23, 2020
Map() function freezes on large dataset 🤗Datasets	8	3003	September 10, 2023
Dataset map() creates lot of cache files 🤗Datasets	6	6519	March 26, 2024

Dealing with large objects as arguments in datasets.map

Related topics