The datasets.map function does not load cached dataset

ezio98 · August 17, 2021, 7:24am

Thanks for your reply! I am sure that all of the parameters are not changed. Actually, I have tried to figure out the reason and found an interesting thing. The map function indeed loads the processed datasets if I changed nothing. However, if I copy the codes to another .py file and run it, the datasets are processed again. But I suppose the second processing should not happen.

The following is the code snippet, I changed nothing but run it in different files.

if __name__ == '__main__':
    from datasets import load_dataset
    from transformers import AutoTokenizer
    raw_datasets = load_dataset("wikitext", "wikitext-2-raw-v1")

    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")


    def tokenize_function(examples):
        return tokenizer(examples["text"], padding="max_length", truncation=True)


    tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)

Topic		Replies	Views
Caching a dataset with map() when loaded with from_dict() 🤗Datasets	3	2738	March 22, 2023
`load_from_cache_file` not working 🤗Datasets	1	2188	May 10, 2021
Dealing with large objects as arguments in datasets.map 🤗Datasets	2	699	October 21, 2021
Pipeline with custom dataset tokenizer: when to save/load manually 🤗Datasets	18	5646	September 18, 2020
Dataset map() creates lot of cache files 🤗Datasets	6	6602	March 26, 2024

The datasets.map function does not load cached dataset

Related topics