Thank you for your answer, that issue is also what I’m wondering.
I have another question, my custom dataset here I’m using lm_dataformat to read .zstd
files. I notice that after download and read the data the first time (for training tokenizer), it create 3 files:
xxx-train.arrow
-
xxx-validation.arrow
(corresponding to my split name in loading script) dataset_info.json
I have two question:
- 1, Can I create a new dataset from these arrow and json files so that next time loading I don’t have to decompress again?
- 2, I’m running run_clm_flax.py and passing my dataset name to this script, using TPU-VM. I set
$HF_HOME
to~/hfcache
, which is 270 GB tmpfs that I mount from TPU-VM Ram. But I still faced out of memory. What is your suggestion to overcome this problem? My dataset is ~25G compress, two arrow files are around ~60GB. Can I delete thesexxx-train.arrow
andxxx-validation.arrow
when creatingtmpxxx
andcachexxx
files ?