Most efficient way to retrieve N rows for a subset of columns

valv · October 28, 2021, 10:52am

Hello,

I would like to retrieve rows from a dataset using a sequence of indexes as efficiently as possible. Each row contains many fields, so I would like to query the Arrow table for a subset of columns in order to exploit at best the column format.

My current method is the following:

def retrieve_rows(dataset: Dataset, indexes: Iterable[int], keys:List[str]):
    """Retrieved n rows from the `dataset` for the specific keys."""
    if keys is not None and len(keys) == 1:
        key = keys[0]
        retrieved_rows = map(dataset[key].__getitem__, indexes)
        retrieved_rows = [{key: x} for x in retrieved_docs]
    else:
        retrieved_rows = map(dataset.__getitem__, indexes)
        # filter keys
        retrieved_docs = [{k: v for k, v in row.items() if keys is None or k in keys} for row in retrieved_rows]
   return retrieved_docs

Limitations

However, it comes with two limitations

for len(keys)==1, the whole column is loaded.
for len(keys)>1, all columns are queried.

Dataset comes with a select method, but this create a new Dataset object, which seems quite cumbersome for my use case.

Questions

So my questions are:
a. how to query rows for a subset of columns
b. how to batch queries (or using an iterator of idx)
c. or alternatively, is it possible to return the Arrow table directly, so I can fine-tune the queries?

Topic		Replies	Views
Index retrieval speed varies considerably with dataset size 🤗Datasets	2	868	May 9, 2022
Fetching rows of a large Dataset by index 🤗Datasets	10	1641	March 15, 2021
Remove a row/specific index from the dataset 🤗Datasets	6	13521	February 8, 2025
Is `dataset.select(range(10000))` efficient? 🤗Datasets	1	355	July 18, 2023
Filtering Dataset Beginners	3	5749	April 8, 2024

Most efficient way to retrieve N rows for a subset of columns

Limitations

Questions

Related topics