How to cache tokenization for the data

Hi I tokenize my data as follows but every time I try to run it, the code does the mapping scratch although there is a cached one in the respective folder. Can anyone help to avoid this redundant process?

tokenizer=AutoTokenizer.from_pretrained(script_args.model_name, cache_dir="hf_cache_dir", local_files_only=True)
def tokenize_function(example):
    return tokenizer(example["text"], truncation=True)

tokenized_datasets =, batched=True, load_from_cache_file=True)

Hi! This is a known issue: AutoTokenizer hash value got change after · Issue #3638 · huggingface/datasets · GitHub. It requires rewriting large parts of the tokenizers lib, so we haven’t fixed it yet. In the meantime, you can bypass it by setting the tokenizer’s state with a dummy call such as _ = tokenizer("Dummy text", truncation=True) before the map.

Thanks for your suggestion but this didn’t work out for me.