[Question] Is there a `skip_lines` when using `datasets.load_dataset("csv", stream=True, ...) like how torchdata supports it?

alvations · November 2, 2022, 10:58pm

Is there a skip_lines when using datasets.load_dataset("csv", stream=True, ...), like how torchdata’s DataPipe’s supports it?

CSVParser — TorchData main documentation, for example:

import pytorch_lightning as pl
from torchdata.datapipes.iter import IterDataPipe, IterableWrapper

dp_chained_datapipe: IterDataPipe = (
            IterableWrapper(iterable=self.csv_files)
            .open_files()
            .parse_csv_as_dict(skip_lines=self.skip_lines, delimiter='\t')
        )

Pandas support skiprows, pandas.read_csv — pandas 1.5.1 documentation and from datasets/dataset_dict.py at 3c1981239ce4ddac4774032948a11b00ec6fb3da · huggingface/datasets · GitHub

Does that mean that we can essentially do as below to skip the rows?

ds = load_dataset("csv", ..., stream=True, skiprows=10)

or

ds = load_dataset("csv", ..., stream=True, skiprows=[0, 100, 203, 423, 204])

mariosasko · November 3, 2022, 3:29pm

Hi! Yes, that’s exactly how you can use the skiprows parameter with the CSV builder.

alvations · November 3, 2022, 3:51pm

Thank you for the confirmation!

Topic		Replies	Views
Slow DataLoader with big batch_size 🤗Datasets	4	1731	October 5, 2023
Interleaving Iterable Dataset with num_workers > 0 🤗Datasets	3	1565	April 11, 2023
Roadmap/timeline for dataset streaming 🤗Datasets	9	2271	July 5, 2021
Is it possible to get multiple rows at once via Streaming? Beginners	2	19	July 9, 2025
NotImplementedError when solidifying a streaming dataset 🤗Datasets	11	2923	November 23, 2023

[Question] Is there a `skip_lines` when using `datasets.load_dataset("csv", stream=True, ...) like how torchdata supports it?

Related topics