Querying column is slow for datasets with indices mapping

Hi,

I have a large dataset and wanna query values for a specific column for batch packing. It turns out that querying is super slow if I use datasets with indices mapping (e.g., after train_test_split or select operations).

The code below reproduces the issue:

import time
import numpy as np
from datasets import Dataset

input_ids = np.random.randint(0, 60_000, (500_000, 128)).tolist()
length = np.random.randint(3, 128, (500_000)).tolist()

dataset = Dataset.from_dict({"input_ids": input_ids, "length": length})
dataset_dict = dataset.train_test_split(test_size=0.1)

# ---------------------------------------------------------------
# Original dataset
start = time.time()
_ = dataset["length"]
print(f"Operation took {time.time() - start:.2f} seconds")
# Operation took 0.15 seconds
# ---------------------------------------------------------------

# ---------------------------------------------------------------
# Dataset with indices mapping
start = time.time()
_ = dataset_dict["train"]["length"]
print(f"Operation took {time.time() - start:.2f} seconds")
# Operation took 5.74 seconds
# ---------------------------------------------------------------

It takes forever to load values for my 500+ Gb dataset, and I have to use flatten_indices on each dataset split to get deep copies with super fast querying performance. Anyway, flatten_indices is no faster and also takes too much time. So preprocessing becomes super painful to achieve acceptable performance.

Is there a way to achieve acceptable performance without flattening indices?

1 Like

Which version are you using? The latest versions give dramatic improvements on such querying operations.

I’m using the latest published version (1.6.2).

Hi ! Since you used train_test_split, then the new train dataset is now shuffled. Querying a shuffled dataset takes some time since it has to re-order all the elements. So the bigger the dataset, the longer it takes to query all the data of a shuffled column.

If you want more speed, either don’t shuffle (shuffle=False in train_test_split), or use flatten_indices once and for all (it re-order the rows correctly).

2 Likes