Most efficient way to retrieve N rows for a subset of columns

valv · October 28, 2021, 10:52am

Hello,

I would like to retrieve rows from a dataset using a sequence of indexes as efficiently as possible. Each row contains many fields, so I would like to query the Arrow table for a subset of columns in order to exploit at best the column format.

My current method is the following:

def retrieve_rows(dataset: Dataset, indexes: Iterable[int], keys:List[str]):
    """Retrieved n rows from the `dataset` for the specific keys."""
    if keys is not None and len(keys) == 1:
        key = keys[0]
        retrieved_rows = map(dataset[key].__getitem__, indexes)
        retrieved_rows = [{key: x} for x in retrieved_docs]
    else:
        retrieved_rows = map(dataset.__getitem__, indexes)
        # filter keys
        retrieved_docs = [{k: v for k, v in row.items() if keys is None or k in keys} for row in retrieved_rows]
   return retrieved_docs

Limitations

However, it comes with two limitations

for len(keys)==1, the whole column is loaded.
for len(keys)>1, all columns are queried.

Dataset comes with a select method, but this create a new Dataset object, which seems quite cumbersome for my use case.

Questions

So my questions are:
a. how to query rows for a subset of columns
b. how to batch queries (or using an iterator of idx)
c. or alternatively, is it possible to return the Arrow table directly, so I can fine-tune the queries?

mariosasko · October 28, 2021, 11:01pm

Hi,

this is a bit more optimized version of your function:

def retrieve_rows(dataset: Dataset, indexes: Iterable[int], keys:List[str]):
    """Retrieved n rows from the `dataset` for the specific keys."""
    rows = [dataset[i] for i in indexes]
    return [{key: row[key] for key in keys} for row in rows]

Be careful with the dataset[key][index] call because this first loads the entire column into memory, which is OK if the dataset is small.

However, I’d suggest you to use select to select rows because it’s very cheap: it re-uses the underlying Arrow table and creates a file to store indices instead. To keep those indices in memory and not in the file, specify keep_in_memory=True in the select call.

Similarly, a subset of columns can be selected with:

dataset_col_subset = dataset.remove_columns(set(dataset.column_names) - set(keys))

If you want, you can access the underlying Arrow table with dataset._data.table.

valv · November 3, 2021, 10:48am

Hi Mario,

Thank you very much for the detailed reply. I have tried different things, including using the pyarrow.Table.take method directly. Based on profiling, I have settled on:

class FetchRows:
    def __init__(self, dataset: Dataset, keys: List[str]):
        self.dataset = dataset.remove_columns(set(dataset.column_names) - set(keys))

    def __call__(self, indexes: List[int]) -> Dict[str, Any]:
        return self.dataset.select(indexes, keep_in_memory=True)[None:None]

Topic		Replies	Views
Index retrieval speed varies considerably with dataset size 🤗Datasets	2	865	May 9, 2022
Fetching rows of a large Dataset by index 🤗Datasets	10	1633	March 15, 2021
Remove a row/specific index from the dataset 🤗Datasets	6	13403	February 8, 2025
Is `dataset.select(range(10000))` efficient? 🤗Datasets	1	348	July 18, 2023
Filtering Dataset Beginners	3	5645	April 8, 2024

Most efficient way to retrieve N rows for a subset of columns

Limitations

Questions

Related topics