How to deal with unpickable objects in map

During the creation of my dataset I would like to add sent2vec representations of input sentences to the dataset. The code would look like this:

import sent2vec
from datasets import load_dataset

sent2vec_model = sent2vec.Sent2vecModel()
sent2vec_model.load_model(sent2vec_path, inference_mode=True)

datasets = load_dataset("text", data_files={"train": train_f, "validation": valid_f})

def preprocess(sentences):
    embedded_sents = sent2vec_model.embed_sentences(sentences["text"])
    return {"text": sentences["text"], "embeddings": embedded_sents}

datasets.map(preprocess, batch_size=None, batched=True)

Unfortunately this won’t work as the sent2vec model can’t be pickled (it seems), and the fingerprint generation thus fails. At first I thought the issue was that map uses multiprocessing by default but using num_proc=1 does not help either. From the error trace it seems that the error arises during the fingerprint/hash update when the sent2vec model is trying to pickled.

File "/mnt/c/dev/python/neural-fuzzy-repair/nfr/finetuning.py", line 48, in create_datasets
    datasets.map(preprocess, batch_size=None, batched=True)
  File "/home/bram/.local/share/virtualenvs/neural-fuzzy-repair-b49KnSNp/lib/python3.8/site-packages/datasets/dataset_dict.py", line 283, in map
    {
  File "/home/bram/.local/share/virtualenvs/neural-fuzzy-repair-b49KnSNp/lib/python3.8/site-packages/datasets/dataset_dict.py", line 284, in <dictcomp>
    k: dataset.map(
  File "/home/bram/.local/share/virtualenvs/neural-fuzzy-repair-b49KnSNp/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 1240, in map
    return self._map_single(
  File "/home/bram/.local/share/virtualenvs/neural-fuzzy-repair-b49KnSNp/lib/python3.8/site-packages/datasets/arrow_dataset.py", line 156, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/home/bram/.local/share/virtualenvs/neural-fuzzy-repair-b49KnSNp/lib/python3.8/site-packages/datasets/fingerprint.py", line 157, in wrapper
    kwargs[fingerprint_name] = update_fingerprint(
  File "/home/bram/.local/share/virtualenvs/neural-fuzzy-repair-b49KnSNp/lib/python3.8/site-packages/datasets/fingerprint.py", line 105, in update_fingerprint
    hasher.update(transform_args[key])
  File "/home/bram/.local/share/virtualenvs/neural-fuzzy-repair-b49KnSNp/lib/python3.8/site-packages/datasets/fingerprint.py", line 57, in update
    self.m.update(self.hash(value).encode("utf-8"))
  File "/home/bram/.local/share/virtualenvs/neural-fuzzy-repair-b49KnSNp/lib/python3.8/site-packages/datasets/fingerprint.py", line 53, in hash
    return cls.hash_default(value)
  File "/home/bram/.local/share/virtualenvs/neural-fuzzy-repair-b49KnSNp/lib/python3.8/site-packages/datasets/fingerprint.py", line 46, in hash_default
    return cls.hash_bytes(dumps(value))
  File "/home/bram/.local/share/virtualenvs/neural-fuzzy-repair-b49KnSNp/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 367, in dumps
    dump(obj, file)
  File "/home/bram/.local/share/virtualenvs/neural-fuzzy-repair-b49KnSNp/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 339, in dump
    Pickler(file, recurse=True).dump(obj)
  File "/home/bram/.local/share/virtualenvs/neural-fuzzy-repair-b49KnSNp/lib/python3.8/site-packages/dill/_dill.py", line 446, in dump
    StockPickler.dump(self, obj)
  File "/home/bram/.pyenv/versions/3.8.6/lib/python3.8/pickle.py", line 485, in dump
    self.save(obj)
  File "/home/bram/.pyenv/versions/3.8.6/lib/python3.8/pickle.py", line 558, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/bram/.local/share/virtualenvs/neural-fuzzy-repair-b49KnSNp/lib/python3.8/site-packages/dill/_dill.py", line 1435, in save_function
    pickler.save_reduce(_create_function, (obj.__code__,
  File "/home/bram/.pyenv/versions/3.8.6/lib/python3.8/pickle.py", line 690, in save_reduce
    save(args)
  File "/home/bram/.pyenv/versions/3.8.6/lib/python3.8/pickle.py", line 558, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/bram/.pyenv/versions/3.8.6/lib/python3.8/pickle.py", line 899, in save_tuple
    save(element)
  File "/home/bram/.pyenv/versions/3.8.6/lib/python3.8/pickle.py", line 558, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/bram/.pyenv/versions/3.8.6/lib/python3.8/pickle.py", line 884, in save_tuple
    save(element)
  File "/home/bram/.pyenv/versions/3.8.6/lib/python3.8/pickle.py", line 558, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/bram/.local/share/virtualenvs/neural-fuzzy-repair-b49KnSNp/lib/python3.8/site-packages/dill/_dill.py", line 1170, in save_cell
    pickler.save_reduce(_create_cell, (f,), obj=obj)
  File "/home/bram/.pyenv/versions/3.8.6/lib/python3.8/pickle.py", line 690, in save_reduce
    save(args)
  File "/home/bram/.pyenv/versions/3.8.6/lib/python3.8/pickle.py", line 558, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/bram/.pyenv/versions/3.8.6/lib/python3.8/pickle.py", line 884, in save_tuple
    save(element)
  File "/home/bram/.pyenv/versions/3.8.6/lib/python3.8/pickle.py", line 601, in save
    self.save_reduce(obj=obj, *rv)
  File "/home/bram/.pyenv/versions/3.8.6/lib/python3.8/pickle.py", line 715, in save_reduce
    save(state)
  File "/home/bram/.pyenv/versions/3.8.6/lib/python3.8/pickle.py", line 558, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/home/bram/.local/share/virtualenvs/neural-fuzzy-repair-b49KnSNp/lib/python3.8/site-packages/dill/_dill.py", line 933, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/home/bram/.pyenv/versions/3.8.6/lib/python3.8/pickle.py", line 969, in save_dict
    self._batch_setitems(obj.items())
  File "/home/bram/.pyenv/versions/3.8.6/lib/python3.8/pickle.py", line 995, in _batch_setitems
    save(v)
  File "/home/bram/.pyenv/versions/3.8.6/lib/python3.8/pickle.py", line 576, in save
    rv = reduce(self.proto)
  File "stringsource", line 2, in sent2vec.Sent2vecModel.__reduce_cython__
TypeError: no default __reduce__ due to non-trivial __cinit__

Is there any way around this? For instance by completely disabling the fingerprinting?

Oh interesting, thanks Bram.

Yes I guess in this case we would have to disable the fingerprinting, right @lhoestq?

Which is a bit too bad because in the future we would have liked to leverage the fingerprint to allow the user to have a super robust reproducibility setup but :man_shrugging: python (+cython in this case) will always be what it is (aka a huge open field).

1 Like

It is unfortunate indeed, but AFAIK the only way for me to create the dataset is by disabling fingerprinting (but I am not sure how I can do that).

I do not think that this is a bug at all. It is normal that this error occurs in the way that fingerprinting works - and I do not ask you to change it. But if there is a way for me to disable fingerprinting, then I can at least create the dataset and save it to disk manually. Down the line I can do something like:

dataset_p = Path("path/to/dataset")
overwrite = False  # if `overwrite`, overwrite existing dataset
if dataset_p.exists() and not overwrite:
    dataset = Dataset.load_from_disk(dataset_p)   
else:
    dataset = ...  # create dataset while fingerprinting is disabled
    dataset.save_to_disk(dataset_p )

In the short term as a quick fix you can just go through a python dict if your dataset is not too big for your memory:

import sent2vec
from datasets import load_dataset

sent2vec_model = sent2vec.Sent2vecModel()
sent2vec_model.load_model(sent2vec_path, inference_mode=True)

datasets = load_dataset("text", data_files={"train": train_f, "validation": valid_f})

# Maybe this can be batched, I don't know about sent2vec_model
temp_train_embeddings = [sent2vec_model.embed_sentences(sentence["text"]) for sentence in datasets['train']]
temp_validation_embeddings = [sent2vec_model.embed_sentences(sentence["text"]) for sentence in datasets['validation']]

preprocessed_train = Dataset.from_dict({"text": datasets['train']["text"], "embeddings": temp_train_embeddings})
preprocessed_validation = Dataset.from_dict({"text": datasets['validation']["text"], "embeddings": temp_validation_embeddings})

preprocessed_dataset = DatasetDict({'train': preprocessed_train, 'validation': preprocessed_validation})

But we should have something a lot simpler imo, this will be a frequent occurrence I think.

Every time you use map, the dataset’s fingerprint is updated using a function that takes as input:

  • the current fingerprint
  • a hash of the mapped function
    The hash is computed using xxhash on the dill dump of the mapped function.
    Therefore right now it’s not possible to compute the hash of unpickable functions.

Right now the best workaround I can think of is to make the function compatible with pickle by replacing it woth a class implementing __call__ and custom __getstate__ and __setstate__ to make it work with pickle. We’ve already seen a similar case in this issue for example.

However this is not very practical so I’d like to have a better support for unpickable functions in the lib.

To make it possible to use map with unpickable functions I see two possibilities. The first one would be to add a way to disable fingerprinting but we lose the caching features.
The second option on the other hand makes it possible to use unpickable functions in map while keeping the caching features:

We could simply let the user decorate its mapped function to sign it with informations that are going to be used to get the function’s hash, instead of pickling the function itself.

For example in your case we could imagine an API that works like this:

@signed_function(id="my_sent2vec_preprocessor_v1")
def preprocess(sentences):
    embedded_sents = sent2vec_model.embed_sentences(sentences["text"])
    return {"text": sentences["text"], "embeddings": embedded_sents}

Let me know what you think

As a first step, disabling fingerprinting seems a good approach. Users should be aware that when they explicitly do so (non-default behaviour) that no automatic caching occurs.

I’m a bit confused about your annotation suggestion. So if I understand correctly, the current approach works as follows: every map (or filter) operation is hashed based on the dill dump of the whole function. You are suggesting to not hash the whole function, but to allow the user to assign a custom identifier to the mapping function. Instead of dill-dumping the whole function, you’d then just use the identifier in the fingerprint. Is that correct?

That would work of course, but it allows a lot room for easy errors. What I mean is that that solution is a lot less robust than the original implementation. For instance, if sent2vec_model is different between two runs (e.g. a different sent2vec model), then that will not be noticed because the fingerprint is the same (same ID of the function). That is of course expected behaviour if you think about it, but as I said it might lead to quick mistakes by users who just want to be able to use unpickable objects in their code but don’t realise what the consequences may be.

I prefer disabling fingerprinting all together and manually saving/loading the dataset. That way I always know what is going on.

Yes I see, these easy errors can be confusing.
Let’s focus on allowing users to disable it for now then

I’ll just add to this discussion that I’ve had errors with the mapping functions themselves not being pickleable (I think just when n_proc > 1). I’ve also had the problem that the map function fingerprint is different across runs even when the function definition is completely unchanged, meaning it fails to use pre-cached mappings from run to run unless I explicitly pass the cache file name.

Was this on Windows? Windows is notorious for its multiprocessing issues because of its spawn forking method.

Unfortunately you need picklable mapping functions to make multiprocessing work :confused:
Also feel free to open an issue or send me a dm if you are in a situation where the caching fails. I can help you with that :slight_smile:

1 Like