Skipping in Steaming mode takes forever

Shaier · April 25, 2023, 9:03pm

I’m trying to use a large dataset (30M+ entries) in streaming mode with skipping the first 12M lines:

full_pubmed = load_dataset('pubmed', streaming=True)['train'].skip(12000000)

But it’s been over 10 min for the following code to run
list(full_pubmed.take(20)).

What is the point of “skip” if it just iterates lines one by one (which is why I assume it takes so much time)? Unless I’m missing something and there’s a better way to skip entries?

lhoestq · April 26, 2023, 4:57pm

Hi ! The orignal pubmed data files are around 1k xml files and we can’t really know in advance where the example at position 12M is located unfortunately, so it has to iterate on the all the examples before finding it.

At one point we’ll support fast skipping if the dataset is made of supported data files like parquet, for which we know the length in advance

Shaier · April 28, 2023, 4:16am

Got it! Thanks a lot for the help

samsja · May 7, 2024, 9:56am

Hey any progress on this topic ? Thanks in advance

lhoestq · May 13, 2024, 4:23pm

Not yet, though it would be cool to let the ExamplesIterable (see datasets/src/datasets/iterable_dataset.py at main · huggingface/datasets · GitHub) have a length that can be used to know how many of them should be skipped when calling .skip() with a high value

Topic		Replies	Views
Iterating on dataset extremely slow 🤗Datasets	8	1892	November 6, 2024
Improve performance IterableDataset (with tokenization) 🤗Datasets	2	764	November 2, 2023
Pubmed dataset size issue 🤗Datasets	1	695	March 15, 2023
[Question] Is there a `skip_lines` when using `datasets.load_dataset("csv", stream=True, ...) like how torchdata supports it? 🤗Datasets	2	478	November 3, 2022
Download only a subset of a split 🤗Datasets	10	16441	February 25, 2025

Skipping in Steaming mode takes forever

Related topics