I am doing the following three steps for a large number of iterations:
- Loading a
parquet
file usingload_dataset()
. - Tokenize it using
dataset.map()
and HuggingFace tokenizers. - Saving the tokenised dataset on disk in
arrow
format.
I have the following questions:
a) I don’t want to save anything to the ~/.cache/huggingface/datasets/
as I am saving the final result at a separate location for further use. I tried using load_dataset(..., cache_dir=None)
and setting datasets.disable_caching()
but none seem to work. From some other threads, I understood that caching can be disabled in dataset.map()
and dataset.filter()
but not in load_dataset()
. How do I disable all types of caching?
b) I plan to train a GPT like transformer model on this tokenised data using the HF ecosystem. I want to conserve disk space but at the same time not make loading extremely slow downstream. What is better arrow
vs parquet
format in Step 3 above?