Create batch from list of ids in the dataset is very slow

dhruvgrammarly · November 24, 2024, 4:14am

I tried a suggestion from this thread Local dataset loading performance: HF's arrow vs torch.load - #3 by mztelus to call .with_format('torch'), but that did NOT help either. Now most of the time is spent in PyArrow’s ChunkedArray.to_numpy() method (pyarrow.ChunkedArray — Apache Arrow v18.0.0).

Topic		Replies	Views
Querying column is slow for datasets with indices mapping 🤗Datasets	3	1483	May 17, 2021
Is there a way to change batching behaviour of `map`? 🤗Datasets	3	509	April 5, 2023
Collate function for tabular data with some text 🤗Datasets	3	573	February 2, 2023
Fetching data takes too too much time 🤗Datasets	1	1285	June 13, 2022
Index retrieval speed varies considerably with dataset size 🤗Datasets	2	862	May 9, 2022