Avoiding hashing in `map`

chriswolfram · January 5, 2025, 6:55pm

I have a function (let’s call it f) that involves running an LLM on strings from the IMDB movie reviews dataset. However, when I run

imdb.map(f)

It grinds of a halt when I am using a large model. I think the problem is that map is trying to hash f which references a large model in its body, and this is very expensive. I found that setting new_fingerprint='foo', load_from_cache_file=False fixes this, but my understanding is that this still creates a cache file that will never be used.

I’ve tried using datasets.disable_caching() before running map, but it still appears to be trying to hash f.

What is the right way to do this? I would have thought that mapping LLMs over datasets is a common use case, and was surprised that this (obscure?) hashing problem doesn’t have an obvious workaround (unless I’m missing something)?

Thanks

ValdeJunior · January 6, 2025, 7:23am

Use a Simple Lambda or Wrapper for the Function:

def f_wrapper(example):
    model = load_model()  # Load the model in this function
    return model(example['text'])  # Process the example with the model

Use batched=True in map():

imdb = imdb.map(f_wrapper, batched=True, load_from_cache_file=False)

Avoid Using datasets.map() Entirely:

results = []
for example in imdb:
    result = f(example)
    results.append(result)

Caching at the Function Level:

model = None

def f(example):
    global model
    if model is None:
        model = load_model()  # Load the model only once
    return model(example['text'])

Disable the hashing in map():

imdb = imdb.map(f, new_fingerprint='foo', load_from_cache_file=False)

Manual Parallelization:

imdb = imdb.map(f_wrapper, batched=True, num_proc=4, load_from_cache_file=False)

Topic		Replies	Views
The datasets.map function does not load cached dataset Beginners	7	2268	November 21, 2023
Dealing with large objects as arguments in datasets.map 🤗Datasets	2	696	October 21, 2021
Dataset can't cache model's outputs 🤗Datasets	3	474	October 27, 2022
Dataset map() creates lot of cache files 🤗Datasets	6	6509	March 26, 2024
Caching a dataset with map() when loaded with from_dict() 🤗Datasets	3	2728	March 22, 2023

Avoiding hashing in `map`

Related topics