How to force caching of previously tokenized data? (run_clm.py)

When using run_clm.py for training/finetuning, caching of the tokenized data is inconsistent. Sometimes runnning it a second or third time (with the exact same training data) will use the cache as expected, other times it will redo the tokenizing.

With small finetuning datasets this doesn’t matter, but now that I’m using something a little heavier, the tokenizing step has become painfully obvious - tokenizing can take 30 minutes or more, even just for a gentle finetune that will only take a couple of minutes of training.

There’s several directories and files under ‘~/.cache/huggingface/datasets/csv/’, with recent dates that correspond to training runs. Unsure if this contains the tokenized data?

Anyway, is there some way to force the cached version to be used and/or determine why it’s choosing to ignore work already done? Thanks!


Edit: I had a look at the (numerous!) debug text from a previous run, and can see that it is caching data under the directory above. So why isn’t the cached data reused?

Dataset csv downloaded and prepared to /home/ai/.cache/huggingface/datasets/csv/default-f9e1c7e8ec8cffb1/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d. Subsequent calls will reuse this data.

(Current run is still re-tokenizing the same data. :frowning: It’s been 45 minutes so far!)

Hi ! Which version of datasets are you using ? Some improvements have been made to caching in 2.15

Anyway if you still have this issue, it means the tokenizer used to tokenize your data is not the same python object across runs. The cache reloads previously computed data if the hash of the pickle dump of your tokenizer is the same across runs, see more info in the docs about The cache

You can force to use a manual cache id (aka dataset fingerprint) by passing the new_fingerprint to map, but make sure you change the fingerprint every time you change your map function or previous results will always be reloaded

ds = ds.map(my_func, new_fingerprint="my_custom_fingerprint_that_identifies_the_processed_dataset")

Hello! Thank you for your reply.

Current datasets is 2.14.5, so I guess I can try upgrading that.

To be clear, I am not a Python programmer, or using library calls etc with a deep understanding of how it works; I am just executing run_clm.py using local training data.

Is there a higher level way to determine why run_clm.py is re-tokenizing the exact same data/timestamp/ size passed with ‘–train_file train.csv’ ?

I did note that ‘huggingface-cli scan-cache’ does not include any datasets in its list (only cached remote models). Should local data that has been tokenized during a previous run show here?


Edit: I upgraded datasets, transformers, and tokenizers

A subsequent run is redoing the tokenizing again. :frowning: Input data has not changed - literally ^C, pip upgrade, re-run. To be fair, I guess the internals of one or more of those packages changing could invalidate previously cached data.

We don’t have tools to automatically debug caching issues unfortunately, this is currently done using python and playing with the datasets hashing mechanism

the huggingface-cli doesn’t support reading the HF datasets cache yet unfortunately. It was developed after the datasets lib and we haven’t migrated to the new HF caching yet for datasets