The datasets.map function does not load cached dataset

I am using the run_mlm.py provided in the transformers repository to pretrain bert. The dataset is of version 1.8.0. Since the used dataset Wikipedia is large, I hope the processing is one time and can be reused later. However, I find it always re-computing instead of load from the disk. I don’t think I changed any parameters to the map function. I notice the description of the parameter new_fingerprint might influence. The original codes of run_mlm.py do not specify it, should I give it a value?

Hi ! new_fingerprint is computed automatically by taking into account:

  • the previous dataset fingerprint
  • a hash of your map function
  • a hash of the parameters passed to map

So as long as you don’t change your code and you keep the same parameters, the fingerprint will stay the same and the dataset will be reloaded from the disk.

Can you make sure you didn’t change your function or the parameters passed to map ?
Note that preprocessing_num_workers is part of the parameters passed to map and you must make sure it doesn’t change either.

Thanks for your reply! I am sure that all of the parameters are not changed. Actually, I have tried to figure out the reason and found an interesting thing. The map function indeed loads the processed datasets if I changed nothing. However, if I copy the codes to another .py file and run it, the datasets are processed again. But I suppose the second processing should not happen.

The following is the code snippet, I changed nothing but run it in different files.

if __name__ == '__main__':
    from datasets import load_dataset
    from transformers import AutoTokenizer
    raw_datasets = load_dataset("wikitext", "wikitext-2-raw-v1")

    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")


    def tokenize_function(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=True)


    tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

Thanks for reporting, indeed we should make sure the cached results get reloaded even if you move your script. Can you open an issue on github so we can work on a solution ?

Yes, I am glad to do that. :grinning:

The link to the issue is The datasets.map function does not load cached dataset · Issue #2825 · huggingface/datasets · GitHub