Is `dataset.select(range(10000))` efficient?

NightMachinery · July 18, 2023, 12:45pm

Is dataset.select(range(10000)) efficient?

Is this the best way to select a slice of the dataset?

mariosasko · July 18, 2023, 1:48pm

Yes, a monotonically increasing range of numbers allows us to slice the underlying PyArrow table instead of generating an indices mapping (makes indexing slower).

Topic		Replies	Views
Efficiently slicing dataset 🤗Datasets	2	2290	December 22, 2022
Most efficient way to retrieve N rows for a subset of columns 🤗Datasets	2	1518	November 3, 2021
Remove a row/specific index from the dataset 🤗Datasets	6	13336	February 8, 2025
Querying column is slow for datasets with indices mapping 🤗Datasets	3	1485	May 17, 2021
Index retrieval speed varies considerably with dataset size 🤗Datasets	2	865	May 9, 2022

Is `dataset.select(range(10000))` efficient?

Related topics