I have a function (let’s call it f
) that involves running an LLM on strings from the IMDB movie reviews dataset. However, when I run
imdb.map(f)
It grinds of a halt when I am using a large model. I think the problem is that map
is trying to hash f
which references a large model in its body, and this is very expensive. I found that setting new_fingerprint='foo', load_from_cache_file=False
fixes this, but my understanding is that this still creates a cache file that will never be used.
I’ve tried using datasets.disable_caching()
before running map
, but it still appears to be trying to hash f
.
What is the right way to do this? I would have thought that mapping LLMs over datasets is a common use case, and was surprised that this (obscure?) hashing problem doesn’t have an obvious workaround (unless I’m missing something)?
Thanks