Querying column is slow for datasets with indices mapping

mozharovsky · May 14, 2021, 11:39am

Hi,

I have a large dataset and wanna query values for a specific column for batch packing. It turns out that querying is super slow if I use datasets with indices mapping (e.g., after train_test_split or select operations).

The code below reproduces the issue:

import time
import numpy as np
from datasets import Dataset

input_ids = np.random.randint(0, 60_000, (500_000, 128)).tolist()
length = np.random.randint(3, 128, (500_000)).tolist()

dataset = Dataset.from_dict({"input_ids": input_ids, "length": length})
dataset_dict = dataset.train_test_split(test_size=0.1)

# ---------------------------------------------------------------
# Original dataset
start = time.time()
_ = dataset["length"]
print(f"Operation took {time.time() - start:.2f} seconds")
# Operation took 0.15 seconds
# ---------------------------------------------------------------

# ---------------------------------------------------------------
# Dataset with indices mapping
start = time.time()
_ = dataset_dict["train"]["length"]
print(f"Operation took {time.time() - start:.2f} seconds")
# Operation took 5.74 seconds
# ---------------------------------------------------------------

It takes forever to load values for my 500+ Gb dataset, and I have to use flatten_indices on each dataset split to get deep copies with super fast querying performance. Anyway, flatten_indices is no faster and also takes too much time. So preprocessing becomes super painful to achieve acceptable performance.

Is there a way to achieve acceptable performance without flattening indices?

thomwolf · May 15, 2021, 4:27pm

Which version are you using? The latest versions give dramatic improvements on such querying operations.

mozharovsky · May 15, 2021, 8:43pm

I’m using the latest published version (1.6.2).

lhoestq · May 17, 2021, 8:56am

Hi ! Since you used train_test_split, then the new train dataset is now shuffled. Querying a shuffled dataset takes some time since it has to re-order all the elements. So the bigger the dataset, the longer it takes to query all the data of a shuffled column.

If you want more speed, either don’t shuffle (shuffle=False in train_test_split), or use flatten_indices once and for all (it re-order the rows correctly).

Topic		Replies	Views
Index retrieval speed varies considerably with dataset size 🤗Datasets	2	873	May 9, 2022
Fetching rows of a large Dataset by index 🤗Datasets	10	1656	March 15, 2021
Why is simply accessing dataset features so slow? 🤗Datasets	3	3829	November 22, 2021
Performance tips for shuffle and flatten_indices 🤗Datasets	5	2138	December 11, 2024
Most efficient way to retrieve N rows for a subset of columns 🤗Datasets	2	1543	November 3, 2021

Querying column is slow for datasets with indices mapping

Related topics