Querying column is slow for datasets with indices mapping


I have a large dataset and wanna query values for a specific column for batch packing. It turns out that querying is super slow if I use datasets with indices mapping (e.g., after train_test_split or select operations).

The code below reproduces the issue:

import time
import numpy as np
from datasets import Dataset

input_ids = np.random.randint(0, 60_000, (500_000, 128)).tolist()
length = np.random.randint(3, 128, (500_000)).tolist()

dataset = Dataset.from_dict({"input_ids": input_ids, "length": length})
dataset_dict = dataset.train_test_split(test_size=0.1)

# ---------------------------------------------------------------
# Original dataset
start = time.time()
_ = dataset["length"]
print(f"Operation took {time.time() - start:.2f} seconds")
# Operation took 0.15 seconds
# ---------------------------------------------------------------

# ---------------------------------------------------------------
# Dataset with indices mapping
start = time.time()
_ = dataset_dict["train"]["length"]
print(f"Operation took {time.time() - start:.2f} seconds")
# Operation took 5.74 seconds
# ---------------------------------------------------------------

It takes forever to load values for my 500+ Gb dataset, and I have to use flatten_indices on each dataset split to get deep copies with super fast querying performance. Anyway, flatten_indices is no faster and also takes too much time. So preprocessing becomes super painful to achieve acceptable performance.

Is there a way to achieve acceptable performance without flattening indices?

Which version are you using? The latest versions give dramatic improvements on such querying operations.

I’m using the latest published version (1.6.2).

Hi ! Since you used train_test_split, then the new train dataset is now shuffled. Querying a shuffled dataset takes some time since it has to re-order all the elements. So the bigger the dataset, the longer it takes to query all the data of a shuffled column.

If you want more speed, either don’t shuffle (shuffle=False in train_test_split), or use flatten_indices once and for all (it re-order the rows correctly).

1 Like