Most efficient way to retrieve N rows for a subset of columns

Hello,

I would like to retrieve rows from a dataset using a sequence of indexes as efficiently as possible. Each row contains many fields, so I would like to query the Arrow table for a subset of columns in order to exploit at best the column format.

My current method is the following:

def retrieve_rows(dataset: Dataset, indexes: Iterable[int], keys:List[str]):
    """Retrieved n rows from the `dataset` for the specific keys."""
    if keys is not None and len(keys) == 1:
        key = keys[0]
        retrieved_rows = map(dataset[key].__getitem__, indexes)
        retrieved_rows = [{key: x} for x in retrieved_docs]
    else:
        retrieved_rows = map(dataset.__getitem__, indexes)
        # filter keys
        retrieved_docs = [{k: v for k, v in row.items() if keys is None or k in keys} for row in retrieved_rows]
   return retrieved_docs

Limitations

However, it comes with two limitations

  1. for len(keys)==1, the whole column is loaded.
  2. for len(keys)>1, all columns are queried.

Dataset comes with a select method, but this create a new Dataset object, which seems quite cumbersome for my use case.

Questions

So my questions are:
a. how to query rows for a subset of columns
b. how to batch queries (or using an iterator of idx)
c. or alternatively, is it possible to return the Arrow table directly, so I can fine-tune the queries?

Hi,

this is a bit more optimized version of your function:

def retrieve_rows(dataset: Dataset, indexes: Iterable[int], keys:List[str]):
    """Retrieved n rows from the `dataset` for the specific keys."""
    rows = [dataset[i] for i in indexes]
    return [{key: row[key] for key in keys} for row in rows]

Be careful with the dataset[key][index] call because this first loads the entire column into memory, which is OK if the dataset is small.

However, I’d suggest you to use select to select rows because it’s very cheap: it re-uses the underlying Arrow table and creates a file to store indices instead. To keep those indices in memory and not in the file, specify keep_in_memory=True in the select call.

Similarly, a subset of columns can be selected with:

dataset_col_subset = dataset.remove_columns(set(dataset.column_names) - set(keys)) 

If you want, you can access the underlying Arrow table with dataset._data.table.

Hi Mario,

Thank you very much for the detailed reply. I have tried different things, including using the pyarrow.Table.take method directly. Based on profiling, I have settled on:

class FetchRows:
    def __init__(self, dataset: Dataset, keys: List[str]):
        self.dataset = dataset.remove_columns(set(dataset.column_names) - set(keys))

    def __call__(self, indexes: List[int]) -> Dict[str, Any]:
        return self.dataset.select(indexes, keep_in_memory=True)[None:None]