Dataset can't cache model's outputs

Hi,

I try to cache some outputs of teacher model( Knowledge Distillation ) by using map function of Dataset library, while every time I run my code, I still recompute all the sequences. I tested Bert Model like this, I got different hash every single run, so any idea to deal with this?

from transformers import BertModel
from transformers import AutoTokenizer
import torch
token = ['hello']
model = BertModel.from_pretrained("bert-base-uncased").eval()
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
def abcd():
    with torch.no_grad():
        
        out = model(**tok(token,return_tensors='pt'))[0]
        # out = tok(token)
    return out
from datasets.fingerprint import Hasher
my_func = abcd
print(Hasher.hash(my_func))

print(abcd())

It looks like the model doesn’t have a deterministic hash: every time I run Hasher.hash(model) I get a different hash.

If the hash is not deterministic, then the fingerprint of the processed dataset is not deterministic either - so the cache can’t reload previously computed results, see more info in the docs

However you can specify a the processed dataset’s fingerprint yourself by passing new_fingerprint= to map

ds = ds.map(my_func, new_fingerprint="hash_that_identifies_my_func_and_my_dataset")

be careful using this though - if you change my_func or your input dataset you have to change new_fingerprint to a new value - otherwise the cache will reload the result from previous computations

I opened an issue to improve the hashing of pytorch tensors: [Caching] Deterministic hashing of torch tensors · Issue #5170 · huggingface/datasets · GitHub

1 Like

Thanks! it works.