Creating HuggingFace Dataset from PyArrow table is slow

I tried to convert a pyarrow table to a hugging face dataset with the following code

arrow_tbl = deltalake.DeltaTable(
    table_uri=table_uri
).to_pyarrow_table()

print("Converting to HF ...")
start_time = time.time()
hf_ds = HFDataset(arrow_tbl)

print(f"Took {time.time() - start_time} seconds")

The Delta Lake table contains 1 image binary type column and 1 string type column.
The binary column’s size on average is round 500kb and the string column’s size is much more smaller in comparison so I think it doesn’t matter.
The number of rows is 5000.

It took around 14 seconds to convert pyarrow table to huggingface dataset which doesn’t make sense to me because I think we can zero-copy the underlying pyarrow data. Did I miss anything?

1 Like

Hi ! The arrow table is hashed to create a unique Dataset fingerprint (typically a hash string) used for caching :slight_smile:

But you can provide the fingerprint yourself to make it faster:

ds = Dataset(arrow_tbl, fingerprint=fingerprint)
2 Likes