Is there a skip_lines
when using datasets.load_dataset("csv", stream=True, ...)
, like how torchdata’s DataPipe’s supports it?
CSVParser — TorchData main documentation, for example:
import pytorch_lightning as pl
from torchdata.datapipes.iter import IterDataPipe, IterableWrapper
dp_chained_datapipe: IterDataPipe = (
IterableWrapper(iterable=self.csv_files)
.open_files()
.parse_csv_as_dict(skip_lines=self.skip_lines, delimiter='\t')
)
Pandas support skiprows
, pandas.read_csv — pandas 1.5.1 documentation and from datasets/dataset_dict.py at 3c1981239ce4ddac4774032948a11b00ec6fb3da · huggingface/datasets · GitHub
Does that mean that we can essentially do as below to skip the rows?
ds = load_dataset("csv", ..., stream=True, skiprows=10)
or
ds = load_dataset("csv", ..., stream=True, skiprows=[0, 100, 203, 423, 204])