I tried a suggestion from this thread Local dataset loading performance: HF's arrow vs torch.load - #3 by mztelus to call .with_format('torch')
, but that did NOT help either. Now most of the time is spent in PyArrow’s ChunkedArray.to_numpy()
method (pyarrow.ChunkedArray — Apache Arrow v18.0.0).
1 Like