The datasets.map function does not load cached dataset

ezio98 · August 2, 2021, 3:31am

I am using the run_mlm.py provided in the transformers repository to pretrain bert. The dataset is of version 1.8.0. Since the used dataset Wikipedia is large, I hope the processing is one time and can be reused later. However, I find it always re-computing instead of load from the disk. I don’t think I changed any parameters to the map function. I notice the description of the parameter new_fingerprint might influence. The original codes of run_mlm.py do not specify it, should I give it a value?

lhoestq · August 16, 2021, 8:42am

Hi ! new_fingerprint is computed automatically by taking into account:

the previous dataset fingerprint
a hash of your map function
a hash of the parameters passed to map

So as long as you don’t change your code and you keep the same parameters, the fingerprint will stay the same and the dataset will be reloaded from the disk.

Can you make sure you didn’t change your function or the parameters passed to map ?
Note that preprocessing_num_workers is part of the parameters passed to map and you must make sure it doesn’t change either.

ezio98 · August 17, 2021, 7:24am

Thanks for your reply! I am sure that all of the parameters are not changed. Actually, I have tried to figure out the reason and found an interesting thing. The map function indeed loads the processed datasets if I changed nothing. However, if I copy the codes to another .py file and run it, the datasets are processed again. But I suppose the second processing should not happen.

The following is the code snippet, I changed nothing but run it in different files.

if __name__ == '__main__':
    from datasets import load_dataset
    from transformers import AutoTokenizer
    raw_datasets = load_dataset("wikitext", "wikitext-2-raw-v1")

    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")


    def tokenize_function(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=True)


    tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

lhoestq · August 18, 2021, 3:36pm

Thanks for reporting, indeed we should make sure the cached results get reloaded even if you move your script. Can you open an issue on github so we can work on a solution ?

ezio98 · August 20, 2021, 1:50am

Yes, I am glad to do that.

ezio98 · August 23, 2021, 3:24am

The link to the issue is The datasets.map function does not load cached dataset · Issue #2825 · huggingface/datasets · GitHub

kopyl · November 17, 2023, 6:58pm

@lhoestq

Why it is impossible to have a cached dataset when changing arguments for text to image SD training?

Whenever I change my batch size, the model cache resets and spends another 5 hours mapping my dataset. This is so annoying. I would love to play around with my batch size while not having to wait 5 hours each time I change something in training args.

lhoestq · November 21, 2023, 10:55am

Changing the batch size invalidates the cache because the resulting dataset after processing is not necessarily the same depending on the batch size.

For example if your processing tokenizes text with max-length padding, then your processed dataset is not the same on batch_size 1 or batch_size 8 (one has no padding and the other has padding with fixed length every 8 examples).

I guess your problem is that you use the same batch_size for data processing and for your model training. You should probably set a fixed value for your data processing step.

Topic		Replies	Views
Caching a dataset with map() when loaded with from_dict() 🤗Datasets	3	2729	March 22, 2023
`load_from_cache_file` not working 🤗Datasets	1	2160	May 10, 2021
Dealing with large objects as arguments in datasets.map 🤗Datasets	2	696	October 21, 2021
Pipeline with custom dataset tokenizer: when to save/load manually 🤗Datasets	18	5629	September 18, 2020
Dataset map() creates lot of cache files 🤗Datasets	6	6513	March 26, 2024

The datasets.map function does not load cached dataset

Related topics