Hi !
a) I don’t want to save anything to the
~/.cache/huggingface/datasets/
as I am saving the final result at a separate location for further use. I tried usingload_dataset(..., cache_dir=None)
and settingdatasets.disable_caching()
but none seem to work. From some other threads, I understood that caching can be disabled indataset.map()
anddataset.filter()
but not inload_dataset()
. How do I disable all types of caching?
Indeed currently disable_caching
uses a temp directory when saving intermediate map
results, but load_dataset
still writes the original dataset in ~/.cache/huggingface/datasets/
b) I plan to train a GPT like transformer model on this tokenised data using the HF ecosystem. I want to conserve disk space but at the same time not make loading extremely slow downstream. What is better
arrow
vsparquet
format in Step 3 above?
It depends on the dataset size and your training setup, but usually it’s fine using Arrow. For bigger datasets you may use Parquet instead and use streaming=True