Why is it so slow to access data through iteration with hugginface dataset?

yrok · July 14, 2022, 3:41pm

I have a large txt file with one sentence per line.

When I try to read this file with the below codes, it takes 0.05sec.

with open([data_path],"r") as f:
    test_time=time.time()
    for i in range(100000):
        a=f.readline()

print("%.2f"%(time.time()-test_time))

However, when I tried to do with huggingface dataset, It takes 3.2sec.

 dataset=load_dataset("text",data_files=[data_path])

test_time=time.time()
for i in range(100000):
    a = dataset["train"][i]["text"]
logger.info("time = %.2f"%(time.time()-test_time))

Because i want to read large data, it is very slow to access full data with huggingface dataset.

I would like to know why this happens.

mariosasko · July 21, 2022, 12:24pm

Hi! This is how the “access through iteration” should look like to correctly compare the performance (your code currently benchmarks the standard indexing, which can’t leverage sequential access present while iterating over rows):

for i, ex in enumerate(zip(dataset)):
    if i + 1 == 100000:
        break

At the moment, Dataset.__iter__ call Dataset.__getitem__, so the result should still be the same, but I think we can make this faster if we iterate over batches by default. @lhoestq WDYT? From what I remember, this issue has been reported several times, so I think we should address it.

lhoestq · July 21, 2022, 1:03pm

Yup I agree it will be much faster to access the data by small batches. We can consider using pa_table.to_batches() first, and then iterate over the elements one RecordBatch at a time.

For further optimizations, we can even re-split each RecordBatch to not be slowed down by batch.slice that is linear in time with respect to the size of the RecordBatch.

Another idea would be to pass a subbatch of more than 1 element to the formatter, and then unbatch the data, to further reduce the back and forth between python and c++ code, and benefit from batched c++ code for numpy/torch/tf formatting

EDIT: oh actually there is a pa_table.to_reader than returns an iterable, it could be simpler this way:

for record_batch in pa_table.to_reader(max_chunksize=1):
    pa_subtable = pa.Table.from_batches([record_batch])
    yield format_table(pa_subtable, key=0, formatter=formatter, ...)

Topic		Replies	Views
Hugging face datasets -- reading image shape takes very long time Beginners	1	281	April 4, 2023
Why dataset iterating is so slow? 🤗Datasets	1	2053	January 3, 2023
Loading Huge Image Dataset seems to take a lot of time 🤗Datasets	7	3739	May 16, 2022
Slow Iteration speed (with and without keep_in_memory=True) 🤗Datasets	3	1398	March 14, 2023
Extremely slow data loading of imagefolder 🤗Datasets	9	2430	January 4, 2024

Why is it so slow to access data through iteration with hugginface dataset?

Related topics