Why dataset iterating is so slow?

astariul · December 23, 2022, 1:53am

Iterating my dataset takes long time. I don’t understand why it’s so slow (specially compared to a regular text file) :

import tqdm
from datasets import load_dataset

# test.txt contains 3m lines of text
# Iterate it
with open("test.txt", "r") as f:
        for line in tqdm.tqdm(f):
            pass

# Create a dataset from the text file
dataset = load_dataset("text", data_files={"train": ["test.txt"]})["train"]
# Iterate it
for sample in tqdm.tqdm(dataset):
        pass

The output on my computer :

3027116it [00:00, 5663083.60it/s]
100%|█████████████████████████████████████| 3027116/3027116 [00:35<00:00, 84101.94it/s]

So more than 5m it/s using using raw text file, vs 85k it/s using datasets. Why ?

lhoestq · January 3, 2023, 10:54am

To get the best performance with Arrow you often have to read data chunk by chunk. You can try

for batch in tqdm.tqdm(dataset.iter(batch_size=100)):
    pass

Please also make sure to use the latest version of datasets, we removed some non-optimized code recently

Topic		Replies	Views
Iterating on dataset extremely slow 🤗Datasets	8	1904	November 6, 2024
Why is it so slow to access data through iteration with hugginface dataset? Intermediate	2	2848	July 21, 2022
Generating Vocabulary using Datasets 🤗Datasets	1	1430	August 30, 2022
Slow DataLoader with big batch_size 🤗Datasets	4	1726	October 5, 2023
Improve performance IterableDataset (with tokenization) 🤗Datasets	2	766	November 2, 2023

Why dataset iterating is so slow?

Related topics