Hugging Face Forums

Read CSV multi threading

imthanhlv July 20, 2021, 1:28pm 4

Thank you for your answer, that issue is also what I’m wondering.

I have another question, my custom dataset here I’m using lm_dataformat to read .zstd files. I notice that after download and read the data the first time (for training tokenizer), it create 3 files:

xxx-train.arrow
xxx-validation.arrow (corresponding to my split name in loading script)
dataset_info.json

I have two question:

1, Can I create a new dataset from these arrow and json files so that next time loading I don’t have to decompress again?
2, I’m running run_clm_flax.py and passing my dataset name to this script, using TPU-VM. I set $HF_HOME to ~/hfcache, which is 270 GB tmpfs that I mount from TPU-VM Ram. But I still faced out of memory. What is your suggestion to overcome this problem? My dataset is ~25G compress, two arrow files are around ~60GB. Can I delete these xxx-train.arrow and xxx-validation.arrow when creating tmpxxx and cachexxx files ?

Topic		Replies	Views	Activity
Local dataset loading performance: HF's arrow vs torch.load 🤗Datasets	5	1191	November 24, 2024
Allow streaming of large datasets with image/audio 🤗Datasets	18	3964	May 30, 2022
Custom 20GB Arrow dataset very slow to train Beginners	1	74	March 20, 2025
Loading dataset from disk taking more time than expected 🤗Datasets	0	714	March 14, 2022
Extremely slow data loading of imagefolder 🤗Datasets	9	2467	January 4, 2024