Is it possible to get multiple rows at once via Streaming?

ni-eminen · July 3, 2025, 8:34am

Hi!

I’m looking to stream a dataset where for each getitem call I need the next row and n previous rows. I couldn’t find a way to do this by googling.

Help appreciated!

Thanks!

NuralNexus · July 3, 2025, 8:36am

To stream a dataset where each getitem call returns the current row and the n previous rows, you can create a custom PyTorch Dataset class (or use similar logic in another framework). Here’s a minimal, streaming-compatible pattern without loading everything into memory.

⸻

Goal:

For a given index i, return:

[rows[i - n], …, rows[i - 1], rows[i]]

⸻

Key Design Elements:
• Use a buffer to store the most recent n rows.
• Stream data line by line (e.g. from disk or a file-like iterator).
• Keep track of line positions for seek-style recall (if file-based).

⸻

Example: Custom PyTorch Dataset (streaming from file)

import torch
from torch.utils.data import Dataset
from collections import deque

class ContextWindowDataset(Dataset):
def init(self, filepath, n_previous):
self.filepath = filepath
self.n = n_previous
self.line_offsets =

    # Precompute line start byte positions
    with open(filepath, 'r') as f:
        offset = 0
        for line in f:
            self.line_offsets.append(offset)
            offset += len(line)

    self.total_lines = len(self.line_offsets)

def __len__(self):
    return self.total_lines

def __getitem__(self, index):
    if index < self.n:
        raise IndexError(f"Index {index} too small for {self.n} previous rows")

    result = []
    with open(self.filepath, 'r') as f:
        for i in range(index - self.n, index + 1):
            f.seek(self.line_offsets[i])
            line = f.readline().strip()
            result.append(line)
    return result

⸻

Example Usage:

ds = ContextWindowDataset(“data.csv”, n_previous=3)

print(ds[5]) # Returns [row 2, row 3, row 4, row 5]

⸻

Notes:
• This works with streamed data from disk.
• It avoids loading the full dataset into RAM.
• If your dataset is huge and linearly accessed, you can wrap this with a DataLoader using num_workers=0 to stay sequential.
⸻
LMK if you need help

darioschiraldi99 · July 9, 2025, 5:51am

Yes, it is possible to get multiple rows at once via streaming, depending on the streaming protocol or technology you’re using. For instance, in databases or data streaming platforms like Kafka, you can fetch multiple rows in batches or chunks as part of a continuous stream. This can be achieved by setting batch sizes or using cursors that allow you to retrieve data in bulk, without waiting for each individual row. The exact approach depends on the specific streaming API or database you’re working with.

Topic		Replies	Views
Use load dataset to load a sample of the dataset 🤗Datasets	3	1273	May 24, 2021
Roadmap/timeline for dataset streaming 🤗Datasets	9	2277	July 5, 2021
How do i load part of the data set Beginners	3	98	May 5, 2025
Skip rows with datasets.Dataset.map() 🤗Datasets	1	1744	January 3, 2023
How do i batch in streaming of data set Intermediate	1	51	May 3, 2025

Is it possible to get multiple rows at once via Streaming?

Related topics