Slow Iteration speed (with and without keep_in_memory=True)

lhoestq · March 13, 2023, 4:02pm

To read tokenized text from Arrow, the bottleneck is often the conversion of the tokenized Arrow data to pythons lists.

It’s much faster to load them as torch tensors directly - since the data is loaded using zero-copy from your disk:

dataset = load_from_disk("mydata",keep_in_memory=False)
dataset = dataset.with_format("torch")
loader = DataLoader(dataset, batch_size=1000, collate_fn = collate_fn)

Also make sure to use the latest versions of datasets and torch (minimum 2.10 and 1.13 to get the best speed)

Topic		Replies	Views
Why is it so slow to access data through iteration with hugginface dataset? Intermediate	2	2857	July 21, 2022
Local dataset loading performance: HF's arrow vs torch.load 🤗Datasets	5	1191	November 24, 2024
Dataloader time problem on custom dataset based on huggingface Beginners	2	1033	June 14, 2022
Extremely Slow Loading of Parquet Dataset with datasets 🤗Datasets	2	66	April 30, 2025
Iterating on dataset extremely slow 🤗Datasets	8	2061	November 6, 2024

Slow Iteration speed (with and without keep_in_memory=True)

Related topics