Dataset can't cache model's outputs

jongjyh · October 25, 2022, 12:15pm

Hi,

I try to cache some outputs of teacher model( Knowledge Distillation ) by using map function of Dataset library, while every time I run my code, I still recompute all the sequences. I tested Bert Model like this, I got different hash every single run, so any idea to deal with this?

from transformers import BertModel
from transformers import AutoTokenizer
import torch
token = ['hello']
model = BertModel.from_pretrained("bert-base-uncased").eval()
tok = AutoTokenizer.from_pretrained("bert-base-uncased")
def abcd():
    with torch.no_grad():
        
        out = model(**tok(token,return_tensors='pt'))[0]
        # out = tok(token)
    return out
from datasets.fingerprint import Hasher
my_func = abcd
print(Hasher.hash(my_func))

print(abcd())

lhoestq · October 27, 2022, 8:52am

It looks like the model doesn’t have a deterministic hash: every time I run Hasher.hash(model) I get a different hash.

If the hash is not deterministic, then the fingerprint of the processed dataset is not deterministic either - so the cache can’t reload previously computed results, see more info in the docs

However you can specify a the processed dataset’s fingerprint yourself by passing new_fingerprint= to map

ds = ds.map(my_func, new_fingerprint="hash_that_identifies_my_func_and_my_dataset")

be careful using this though - if you change my_func or your input dataset you have to change new_fingerprint to a new value - otherwise the cache will reload the result from previous computations

lhoestq · October 27, 2022, 9:16am

I opened an issue to improve the hashing of pytorch tensors: [Caching] Deterministic hashing of torch tensors · Issue #5170 · huggingface/datasets · GitHub

jongjyh · October 27, 2022, 11:19am

Thanks! it works.

Topic		Replies	Views
`datasets.map` calls a function that requires a `transformers.PreTrainedModel` object - unpickable object 🤗Datasets	2	1921	December 2, 2022
The datasets.map function does not load cached dataset Beginners	7	2268	November 21, 2023
Avoiding hashing in `map` 🤗Datasets	1	48	January 6, 2025
Pipeline with custom dataset tokenizer: when to save/load manually 🤗Datasets	18	5628	September 18, 2020
How to cache tokenization for the data Beginners	2	834	January 16, 2024

Dataset can't cache model's outputs

Related topics