Iterating my dataset takes long time. I don’t understand why it’s so slow (specially compared to a regular text file) :
import tqdm
from datasets import load_dataset
# test.txt contains 3m lines of text
# Iterate it
with open("test.txt", "r") as f:
for line in tqdm.tqdm(f):
pass
# Create a dataset from the text file
dataset = load_dataset("text", data_files={"train": ["test.txt"]})["train"]
# Iterate it
for sample in tqdm.tqdm(dataset):
pass
The output on my computer :
3027116it [00:00, 5663083.60it/s]
100%|█████████████████████████████████████| 3027116/3027116 [00:35<00:00, 84101.94it/s]
So more than 5m it/s using using raw text file, vs 85k it/s using datasets
. Why ?