Why is it so slow to access data through iteration with hugginface dataset?

I have a large txt file with one sentence per line.

When I try to read this file with the below codes, it takes 0.05sec.

with open([data_path],"r") as f:
    test_time=time.time()
    for i in range(100000):
        a=f.readline()

print("%.2f"%(time.time()-test_time))

However, when I tried to do with huggingface dataset, It takes 3.2sec.

 dataset=load_dataset("text",data_files=[data_path])

test_time=time.time()
for i in range(100000):
    a = dataset["train"][i]["text"]
logger.info("time = %.2f"%(time.time()-test_time))

Because i want to read large data, it is very slow to access full data with huggingface dataset.

I would like to know why this happens.

Hi! This is how the “access through iteration” should look like to correctly compare the performance (your code currently benchmarks the standard indexing, which can’t leverage sequential access present while iterating over rows):

for i, ex in enumerate(zip(dataset)):
    if i + 1 == 100000:
        break

At the moment, Dataset.__iter__ call Dataset.__getitem__, so the result should still be the same, but I think we can make this faster if we iterate over batches by default. @lhoestq WDYT? From what I remember, this issue has been reported several times, so I think we should address it.

Yup I agree it will be much faster to access the data by small batches. We can consider using pa_table.to_batches() first, and then iterate over the elements one RecordBatch at a time.

For further optimizations, we can even re-split each RecordBatch to not be slowed down by batch.slice that is linear in time with respect to the size of the RecordBatch.

Another idea would be to pass a subbatch of more than 1 element to the formatter, and then unbatch the data, to further reduce the back and forth between python and c++ code, and benefit from batched c++ code for numpy/torch/tf formatting

EDIT: oh actually there is a pa_table.to_reader than returns an iterable, it could be simpler this way:

for record_batch in pa_table.to_reader(max_chunksize=1):
    pa_subtable = pa.Table.from_batches([record_batch])
    yield format_table(pa_subtable, key=0, formatter=formatter, ...)