How to cache tokenization for the data

Kason123 · January 12, 2024, 12:42am

Hi I tokenize my data as follows but every time I try to run it, the code does the mapping scratch although there is a cached one in the respective folder. Can anyone help to avoid this redundant process?

tokenizer=AutoTokenizer.from_pretrained(script_args.model_name, cache_dir="hf_cache_dir", local_files_only=True)
def tokenize_function(example):
    return tokenizer(example["text"], truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True, load_from_cache_file=True)

mariosasko · January 15, 2024, 7:11pm

Hi! This is a known issue: AutoTokenizer hash value got change after datasets.map · Issue #3638 · huggingface/datasets · GitHub. It requires rewriting large parts of the tokenizers lib, so we haven’t fixed it yet. In the meantime, you can bypass it by setting the tokenizer’s state with a dummy call such as _ = tokenizer("Dummy text", truncation=True) before the map.

Kason123 · January 16, 2024, 11:59pm

Thanks for your suggestion but this didn’t work out for me.

Topic		Replies	Views
Caching tokenization 🤗Tokenizers	0	249	January 14, 2024
How to force caching of previously tokenized data? (run_clm.py) Beginners	3	707	November 21, 2023
Pipeline with custom dataset tokenizer: when to save/load manually 🤗Datasets	18	5656	September 18, 2020
`load_from_cache_file` not working 🤗Datasets	1	2193	May 10, 2021
Dataset can't cache model's outputs 🤗Datasets	3	476	October 27, 2022

How to cache tokenization for the data

Related topics