Hello,
I would like to retrieve rows from a dataset using a sequence of indexes as efficiently as possible. Each row contains many fields, so I would like to query the Arrow table for a subset of columns in order to exploit at best the column format.
My current method is the following:
def retrieve_rows(dataset: Dataset, indexes: Iterable[int], keys:List[str]):
"""Retrieved n rows from the `dataset` for the specific keys."""
if keys is not None and len(keys) == 1:
key = keys[0]
retrieved_rows = map(dataset[key].__getitem__, indexes)
retrieved_rows = [{key: x} for x in retrieved_docs]
else:
retrieved_rows = map(dataset.__getitem__, indexes)
# filter keys
retrieved_docs = [{k: v for k, v in row.items() if keys is None or k in keys} for row in retrieved_rows]
return retrieved_docs
Limitations
However, it comes with two limitations
- for
len(keys)==1
, the whole column is loaded. - for
len(keys)>1
, all columns are queried.
Dataset
comes with a select
method, but this create a new Dataset
object, which seems quite cumbersome for my use case.
Questions
So my questions are:
a. how to query rows for a subset of columns
b. how to batch queries (or using an iterator of idx
)
c. or alternatively, is it possible to return the Arrow table directly, so I can fine-tune the queries?