Creating HuggingFace Dataset from PyArrow table is slow

m4nhb4nhq · December 11, 2024, 6:12am

I tried to convert a pyarrow table to a hugging face dataset with the following code

arrow_tbl = deltalake.DeltaTable(
    table_uri=table_uri
).to_pyarrow_table()

print("Converting to HF ...")
start_time = time.time()
hf_ds = HFDataset(arrow_tbl)

print(f"Took {time.time() - start_time} seconds")

The Delta Lake table contains 1 image binary type column and 1 string type column.
The binary column’s size on average is round 500kb and the string column’s size is much more smaller in comparison so I think it doesn’t matter.
The number of rows is 5000.

It took around 14 seconds to convert pyarrow table to huggingface dataset which doesn’t make sense to me because I think we can zero-copy the underlying pyarrow data. Did I miss anything?

lhoestq · December 11, 2024, 3:43pm

Hi ! The arrow table is hashed to create a unique Dataset fingerprint (typically a hash string) used for caching

But you can provide the fingerprint yourself to make it faster:

ds = Dataset(arrow_tbl, fingerprint=fingerprint)

Topic		Replies	Views
Load Dataset from arrow file 🤗Datasets	1	11544	October 27, 2022
Loading dataset from disk taking more time than expected 🤗Datasets	0	713	March 14, 2022
Increased arrow table size by factor of ~2 🤗Datasets	5	1018	November 28, 2022
Loading HF datasets with variable size array using pyarrow with the appropriate schema 🤗Datasets	0	38	November 11, 2024
Dataset.from_dict() killed 🤗Datasets	0	155	July 8, 2024

Creating HuggingFace Dataset from PyArrow table is slow

Related topics