Why dataset iterating is so slow?

Iterating my dataset takes long time. I don’t understand why it’s so slow (specially compared to a regular text file) :

import tqdm
from datasets import load_dataset

# test.txt contains 3m lines of text
# Iterate it
with open("test.txt", "r") as f:
        for line in tqdm.tqdm(f):
            pass

# Create a dataset from the text file
dataset = load_dataset("text", data_files={"train": ["test.txt"]})["train"]
# Iterate it
for sample in tqdm.tqdm(dataset):
        pass

The output on my computer :

3027116it [00:00, 5663083.60it/s]
100%|█████████████████████████████████████| 3027116/3027116 [00:35<00:00, 84101.94it/s]

So more than 5m it/s using using raw text file, vs 85k it/s using datasets. Why ?

To get the best performance with Arrow you often have to read data chunk by chunk. You can try

for batch in tqdm.tqdm(dataset.iter(batch_size=100)):
    pass

Please also make sure to use the latest version of datasets, we removed some non-optimized code recently

1 Like