Hi I tokenize my data as follows but every time I try to run it, the code does the mapping scratch although there is a cached one in the respective folder. Can anyone help to avoid this redundant process?
tokenizer=AutoTokenizer.from_pretrained(script_args.model_name, cache_dir="hf_cache_dir", local_files_only=True)
def tokenize_function(example):
return tokenizer(example["text"], truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True, load_from_cache_file=True)