Index retrieval speed varies considerably with dataset size

The retrieval speed of rows using indices varies considerably with dataset size.

import pandas as pd
from datasets import Dataset

dummy_df = pd.DataFrame([['a'*500, 'a'*5000, np.array(range(512))]] * 700000, columns=['col_a', 'col_b', 'col_c'])

dummy_ds = Dataset.from_pandas(dummy_df)
dummy_ds_2 = dummy_ds.remove_columns(['col_c'])

# Test-1: Retrieval on entire dataset
samples_1 = dummy_ds[range(100)]

# Test 2: Retrieval on dataset without col_c
samples_2 = dummy_ds_2[range(100)]

Test-1 takes 17.7 ms to run on my machine, while Test-2 just takes around 654 Āµs. Is there any way to retrieve records faster from datasets with heavy fields?

I also tried using select method. This takes around 1.8ms on both datasets, and it is around 3x times slower than Test-2 (although much better than Test-1).

This may also have a negative impact on methods like get_nearest_examples and get_nearest_examples_batch in search.IndexableMixin if a larger value of k (e.g., 100) is used. And select cannot be used here since the method is not supported on indexed datasets.

Hi ! For long arrays it is faster to read the data as numpy rather than python objects. Indeed, datasets are stored in the Arrow format, which allows zero-copy read to numpy.

To make your query run faster, you can do dummy_ds = dummy_ds.with_format("numpy").