Extremely Slow Loading of Parquet Dataset with datasets

Certain data types are slower to load in pure python than others, like lists. If your dataset contains arrays or long lists, it’s faster to load them as numpy arrays using e.g.

ds = ds.with_format("numpy")

Btw you can also access multiple examples faster using a list of indices in ds[...]:

indices = [...]
examples = ds[indices]

^ this is faster than using a for loop

1 Like