Create batch from list of ids in the dataset is very slow

I tried a suggestion from this thread Local dataset loading performance: HF's arrow vs torch.load - #3 by mztelus to call .with_format('torch'), but that did NOT help either. Now most of the time is spent in PyArrow’s ChunkedArray.to_numpy() method (pyarrow.ChunkedArray — Apache Arrow v18.0.0).

1 Like