[Question] Is there a `skip_lines` when using `datasets.load_dataset("csv", stream=True, ...) like how torchdata supports it?

Is there a skip_lines when using datasets.load_dataset("csv", stream=True, ...), like how torchdata’s DataPipe’s supports it?

CSVParser — TorchData main documentation, for example:

import pytorch_lightning as pl
from torchdata.datapipes.iter import IterDataPipe, IterableWrapper

dp_chained_datapipe: IterDataPipe = (
            .parse_csv_as_dict(skip_lines=self.skip_lines, delimiter='\t')

Pandas support skiprows, pandas.read_csv — pandas 1.5.1 documentation and from datasets/dataset_dict.py at 3c1981239ce4ddac4774032948a11b00ec6fb3da · huggingface/datasets · GitHub

Does that mean that we can essentially do as below to skip the rows?

ds = load_dataset("csv", ..., stream=True, skiprows=10)


ds = load_dataset("csv", ..., stream=True, skiprows=[0, 100, 203, 423, 204])

Hi! Yes, that’s exactly how you can use the skiprows parameter with the CSV builder.

Thank you for the confirmation!