Index retrieval speed varies considerably with dataset size

vishalrao · April 23, 2022, 3:12am

The retrieval speed of rows using indices varies considerably with dataset size.
E.g.,

import pandas as pd
from datasets import Dataset

dummy_df = pd.DataFrame([['a'*500, 'a'*5000, np.array(range(512))]] * 700000, columns=['col_a', 'col_b', 'col_c'])

dummy_ds = Dataset.from_pandas(dummy_df)
dummy_ds_2 = dummy_ds.remove_columns(['col_c'])

# Test-1: Retrieval on entire dataset
samples_1 = dummy_ds[range(100)]

# Test 2: Retrieval on dataset without col_c
samples_2 = dummy_ds_2[range(100)]

Test-1 takes 17.7 ms to run on my machine, while Test-2 just takes around 654 µs. Is there any way to retrieve records faster from datasets with heavy fields?

I also tried using select method. This takes around 1.8ms on both datasets, and it is around 3x times slower than Test-2 (although much better than Test-1).

vishalrao · April 25, 2022, 7:55pm

This may also have a negative impact on methods like get_nearest_examples and get_nearest_examples_batch in search.IndexableMixin if a larger value of k (e.g., 100) is used. And select cannot be used here since the method is not supported on indexed datasets.

lhoestq · May 9, 2022, 10:27am

Hi ! For long arrays it is faster to read the data as numpy rather than python objects. Indeed, datasets are stored in the Arrow format, which allows zero-copy read to numpy.

To make your query run faster, you can do dummy_ds = dummy_ds.with_format("numpy").

Topic		Replies	Views
Querying column is slow for datasets with indices mapping 🤗Datasets	3	1408	May 17, 2021
Most efficient way to retrieve N rows for a subset of columns 🤗Datasets	2	1341	November 3, 2021
Filtering performance 🤗Datasets	3	1648	January 3, 2023
Loading dataset from disk taking more time than expected 🤗Datasets	0	693	March 14, 2022
Datasets map is slower than pandas apply 🤗Datasets	0	1087	April 23, 2022

Index retrieval speed varies considerably with dataset size

Related topics