Caching tokenization

Kason123 · January 14, 2024, 12:16pm

Hi I tokenize my data as follows but every time I try to run it, the code does the mapping scratch although there is a cached one in the respective folder. Can anyone help to avoid this redundant process?

tokenizer=AutoTokenizer.from_pretrained(script_args.model_name, cache_dir="hf_cache_dir", local_files_only=True)
def tokenize_function(example):
    return tokenizer(example["text"], truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True, load_from_cache_file=True)

Topic		Replies	Views
How to cache tokenization for the data Beginners	2	847	January 16, 2024
How to force caching of previously tokenized data? (run_clm.py) Beginners	3	688	November 21, 2023
`load_from_cache_file` not working 🤗Datasets	1	2178	May 10, 2021
Pipeline with custom dataset tokenizer: when to save/load manually 🤗Datasets	18	5637	September 18, 2020
AutoTokenizer keeps redownloading Beginners	1	258	March 13, 2024

Caching tokenization

Related topics