Read CSV multi threading

According to datasets document:

A few interesting features are provided out-of-the-box by the Apache Arrow backend:
    - **multi-threaded** or single-threaded reading
    - automatic decompression of input files (based on the filename extension, such as my_data.csv.gz)
    - fetching column names from the first row in the CSV file
    - column-wise type inference and conversion to one of null, int64, float64, timestamp[s], string or binary data
    - detecting various spellings of null values such as NaN or #N/A

How can I read, for example train.csv.gz in multi-threaded mode ?

I use datasets.load_dataset("csv", data_files="./train.csv.gz") but htop only show one cpu core is running

pinging @lhoestq , sorry I can’t find the multi-threaded part in document

Hi ! Currently the CSV loader doesn’t leverage multithreading nor multiprocessing.

This is something we are working on, see issue [load_dataset] shard and parallelize the process · Issue #2650 · huggingface/datasets · GitHub which should allow to parallelize the conversion over multiple csv files.

However I’m not very familiar tools that allow to do multithreading on single files though. So if you have any idea/direction that could speed up the conversion of csv files to arrow, feel free to share it here :slight_smile:

1 Like

Thank you for your answer, that issue is also what I’m wondering.

I have another question, my custom dataset here I’m using lm_dataformat to read .zstd files. I notice that after download and read the data the first time (for training tokenizer), it create 3 files:

  • xxx-train.arrow
  • xxx-validation.arrow (corresponding to my split name in loading script)
  • dataset_info.json

I have two question:

  • 1, Can I create a new dataset from these arrow and json files so that next time loading I don’t have to decompress again?
  • 2, I’m running run_clm_flax.py and passing my dataset name to this script, using TPU-VM. I set $HF_HOME to ~/hfcache, which is 270 GB tmpfs that I mount from TPU-VM Ram. But I still faced out of memory. What is your suggestion to overcome this problem? My dataset is ~25G compress, two arrow files are around ~60GB. Can I delete these xxx-train.arrow and xxx-validation.arrow when creating tmpxxx and cachexxx files ?

Once you have you processed data, you can save them using save_to_disk to save them in the directory of your choice. When this is done, you can completely delete your cache and reload your processed dataset with load_from_disk for example (see the documentation)

Also note that if you really need to, you can load any arrow file with Dataset.from_file("path/to/any/arrow/file")

Thank you @lhoestq

The arrow file reading worked beaultifully.
In my case, I resized the tmpfs to 300GB and finally it fits my dataset ;__;

I will modify the script so that next time it will not load the raw dataset again but load the preprocessed instead using your suggestion load_from_disk

1 Like