The datasets.map function does not load cached dataset

I am using the run_mlm.py provided in the transformers repository to pretrain bert. The dataset is of version 1.8.0. Since the used dataset Wikipedia is large, I hope the processing is one time and can be reused later. However, I find it always re-computing instead of load from the disk. I don’t think I changed any parameters to the map function. I notice the description of the parameter new_fingerprint might influence. The original codes of run_mlm.py do not specify it, should I give it a value?

1 Like

Hi ! new_fingerprint is computed automatically by taking into account:

  • the previous dataset fingerprint
  • a hash of your map function
  • a hash of the parameters passed to map

So as long as you don’t change your code and you keep the same parameters, the fingerprint will stay the same and the dataset will be reloaded from the disk.

Can you make sure you didn’t change your function or the parameters passed to map ?
Note that preprocessing_num_workers is part of the parameters passed to map and you must make sure it doesn’t change either.

1 Like

Thanks for your reply! I am sure that all of the parameters are not changed. Actually, I have tried to figure out the reason and found an interesting thing. The map function indeed loads the processed datasets if I changed nothing. However, if I copy the codes to another .py file and run it, the datasets are processed again. But I suppose the second processing should not happen.

The following is the code snippet, I changed nothing but run it in different files.

if __name__ == '__main__':
    from datasets import load_dataset
    from transformers import AutoTokenizer
    raw_datasets = load_dataset("wikitext", "wikitext-2-raw-v1")

    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")


    def tokenize_function(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=True)


    tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

Thanks for reporting, indeed we should make sure the cached results get reloaded even if you move your script. Can you open an issue on github so we can work on a solution ?

Yes, I am glad to do that. :grinning:

The link to the issue is The datasets.map function does not load cached dataset · Issue #2825 · huggingface/datasets · GitHub

1 Like

@lhoestq

Why it is impossible to have a cached dataset when changing arguments for text to image SD training?

Whenever I change my batch size, the model cache resets and spends another 5 hours mapping my dataset. This is so annoying. I would love to play around with my batch size while not having to wait 5 hours each time I change something in training args.

Changing the batch size invalidates the cache because the resulting dataset after processing is not necessarily the same depending on the batch size.

For example if your processing tokenizes text with max-length padding, then your processed dataset is not the same on batch_size 1 or batch_size 8 (one has no padding and the other has padding with fixed length every 8 examples).

I guess your problem is that you use the same batch_size for data processing and for your model training. You should probably set a fixed value for your data processing step.