Is it possible to get multiple rows at once via Streaming?

Hi!

I’m looking to stream a dataset where for each getitem call I need the next row and n previous rows. I couldn’t find a way to do this by googling.

Help appreciated!

Thanks!

1 Like

To stream a dataset where each getitem call returns the current row and the n previous rows, you can create a custom PyTorch Dataset class (or use similar logic in another framework). Here’s a minimal, streaming-compatible pattern without loading everything into memory.

:white_check_mark: Goal:

For a given index i, return:

[rows[i - n], …, rows[i - 1], rows[i]]

:puzzle_piece: Key Design Elements:
• Use a buffer to store the most recent n rows.
• Stream data line by line (e.g. from disk or a file-like iterator).
• Keep track of line positions for seek-style recall (if file-based).

:wrench: Example: Custom PyTorch Dataset (streaming from file)

import torch
from torch.utils.data import Dataset
from collections import deque

class ContextWindowDataset(Dataset):
def init(self, filepath, n_previous):
self.filepath = filepath
self.n = n_previous
self.line_offsets =

    # Precompute line start byte positions
    with open(filepath, 'r') as f:
        offset = 0
        for line in f:
            self.line_offsets.append(offset)
            offset += len(line)

    self.total_lines = len(self.line_offsets)

def __len__(self):
    return self.total_lines

def __getitem__(self, index):
    if index < self.n:
        raise IndexError(f"Index {index} too small for {self.n} previous rows")

    result = []
    with open(self.filepath, 'r') as f:
        for i in range(index - self.n, index + 1):
            f.seek(self.line_offsets[i])
            line = f.readline().strip()
            result.append(line)
    return result

:test_tube: Example Usage:

ds = ContextWindowDataset(“data.csv”, n_previous=3)

print(ds[5]) # Returns [row 2, row 3, row 4, row 5]

:brain: Notes:
• This works with streamed data from disk.
• It avoids loading the full dataset into RAM.
• If your dataset is huge and linearly accessed, you can wrap this with a DataLoader using num_workers=0 to stay sequential.

LMK if you need help

1 Like