Read CSV multi threading

imthanhlv · July 19, 2021, 11:04am

A few interesting features are provided out-of-the-box by the Apache Arrow backend:
    - **multi-threaded** or single-threaded reading
    - automatic decompression of input files (based on the filename extension, such as my_data.csv.gz)
    - fetching column names from the first row in the CSV file
    - column-wise type inference and conversion to one of null, int64, float64, timestamp[s], string or binary data
    - detecting various spellings of null values such as NaN or #N/A

How can I read, for example train.csv.gz in multi-threaded mode ?

I use datasets.load_dataset("csv", data_files="./train.csv.gz") but htop only show one cpu core is running

imthanhlv · July 19, 2021, 11:06am

pinging @lhoestq , sorry I can’t find the multi-threaded part in document

lhoestq · July 20, 2021, 9:02am

Hi ! Currently the CSV loader doesn’t leverage multithreading nor multiprocessing.

This is something we are working on, see issue [load_dataset] shard and parallelize the process · Issue #2650 · huggingface/datasets · GitHub which should allow to parallelize the conversion over multiple csv files.

However I’m not very familiar tools that allow to do multithreading on single files though. So if you have any idea/direction that could speed up the conversion of csv files to arrow, feel free to share it here

imthanhlv · July 20, 2021, 1:28pm

Thank you for your answer, that issue is also what I’m wondering.

I have another question, my custom dataset here I’m using lm_dataformat to read .zstd files. I notice that after download and read the data the first time (for training tokenizer), it create 3 files:

xxx-train.arrow
xxx-validation.arrow (corresponding to my split name in loading script)
dataset_info.json

I have two question:

1, Can I create a new dataset from these arrow and json files so that next time loading I don’t have to decompress again?
2, I’m running run_clm_flax.py and passing my dataset name to this script, using TPU-VM. I set $HF_HOME to ~/hfcache, which is 270 GB tmpfs that I mount from TPU-VM Ram. But I still faced out of memory. What is your suggestion to overcome this problem? My dataset is ~25G compress, two arrow files are around ~60GB. Can I delete these xxx-train.arrow and xxx-validation.arrow when creating tmpxxx and cachexxx files ?

lhoestq · July 20, 2021, 1:46pm

Once you have you processed data, you can save them using save_to_disk to save them in the directory of your choice. When this is done, you can completely delete your cache and reload your processed dataset with load_from_disk for example (see the documentation)

Also note that if you really need to, you can load any arrow file with Dataset.from_file("path/to/any/arrow/file")

imthanhlv · July 21, 2021, 11:53am

Thank you @lhoestq

The arrow file reading worked beaultifully.
In my case, I resized the tmpfs to 300GB and finally it fits my dataset ;__;

I will modify the script so that next time it will not load the raw dataset again but load the preprocessed instead using your suggestion load_from_disk

Topic		Replies	Views
Explain why datasets.map is faster compared to other similar libraries 🤗Datasets	4	879	September 6, 2022
Loading multiple serialized datasets with `multiprocessing` 🤗Datasets	2	617	April 2, 2022
Custom 20GB Arrow dataset very slow to train Beginners	1	60	March 20, 2025
[solved] How to load multiple arrow files into one dataset Beginners	4	2938	September 16, 2023
Iterable datasets for array data, limited formatting options 🤗Datasets	2	419	December 28, 2023

Read CSV multi threading

Related topics