Skipping in Steaming mode takes forever

I’m trying to use a large dataset (30M+ entries) in streaming mode with skipping the first 12M lines:

full_pubmed = load_dataset('pubmed', streaming=True)['train'].skip(12000000)

But it’s been over 10 min for the following code to run
list(full_pubmed.take(20)).

What is the point of “skip” if it just iterates lines one by one (which is why I assume it takes so much time)? Unless I’m missing something and there’s a better way to skip entries?

Hi ! The orignal pubmed data files are around 1k xml files and we can’t really know in advance where the example at position 12M is located unfortunately, so it has to iterate on the all the examples before finding it.

At one point we’ll support fast skipping if the dataset is made of supported data files like parquet, for which we know the length in advance

1 Like

Got it! Thanks a lot for the help

Hey any progress on this topic ? Thanks in advance

Not yet, though it would be cool to let the ExamplesIterable (see datasets/src/datasets/iterable_dataset.py at main · huggingface/datasets · GitHub) have a length that can be used to know how many of them should be skipped when calling .skip() with a high value