The retrieval speed of rows using indices varies considerably with dataset size.
E.g.,
import pandas as pd
from datasets import Dataset
dummy_df = pd.DataFrame([['a'*500, 'a'*5000, np.array(range(512))]] * 700000, columns=['col_a', 'col_b', 'col_c'])
dummy_ds = Dataset.from_pandas(dummy_df)
dummy_ds_2 = dummy_ds.remove_columns(['col_c'])
# Test-1: Retrieval on entire dataset
samples_1 = dummy_ds[range(100)]
# Test 2: Retrieval on dataset without col_c
samples_2 = dummy_ds_2[range(100)]
Test-1 takes 17.7 ms to run on my machine, while Test-2 just takes around 654 µs. Is there any way to retrieve records faster from datasets with heavy fields?
I also tried using select method. This takes around 1.8ms on both datasets, and it is around 3x times slower than Test-2 (although much better than Test-1).