I tried to convert a pyarrow table to a hugging face dataset with the following code
arrow_tbl = deltalake.DeltaTable(
table_uri=table_uri
).to_pyarrow_table()
print("Converting to HF ...")
start_time = time.time()
hf_ds = HFDataset(arrow_tbl)
print(f"Took {time.time() - start_time} seconds")
The Delta Lake table contains 1 image binary type column and 1 string type column.
The binary column’s size on average is round 500kb and the string column’s size is much more smaller in comparison so I think it doesn’t matter.
The number of rows is 5000.
It took around 14 seconds to convert pyarrow table to huggingface dataset which doesn’t make sense to me because I think we can zero-copy the underlying pyarrow data. Did I miss anything?