Hi!
I’m looking to stream a dataset where for each getitem call I need the next row and n previous rows. I couldn’t find a way to do this by googling.
Help appreciated!
Thanks!
Hi!
I’m looking to stream a dataset where for each getitem call I need the next row and n previous rows. I couldn’t find a way to do this by googling.
Help appreciated!
Thanks!
To stream a dataset where each getitem call returns the current row and the n previous rows, you can create a custom PyTorch Dataset class (or use similar logic in another framework). Here’s a minimal, streaming-compatible pattern without loading everything into memory.
⸻
Goal:
For a given index i, return:
[rows[i - n], …, rows[i - 1], rows[i]]
⸻
Key Design Elements:
• Use a buffer to store the most recent n rows.
• Stream data line by line (e.g. from disk or a file-like iterator).
• Keep track of line positions for seek-style recall (if file-based).
⸻
Example: Custom PyTorch Dataset (streaming from file)
import torch
from torch.utils.data import Dataset
from collections import deque
class ContextWindowDataset(Dataset):
def init(self, filepath, n_previous):
self.filepath = filepath
self.n = n_previous
self.line_offsets =
# Precompute line start byte positions
with open(filepath, 'r') as f:
offset = 0
for line in f:
self.line_offsets.append(offset)
offset += len(line)
self.total_lines = len(self.line_offsets)
def __len__(self):
return self.total_lines
def __getitem__(self, index):
if index < self.n:
raise IndexError(f"Index {index} too small for {self.n} previous rows")
result = []
with open(self.filepath, 'r') as f:
for i in range(index - self.n, index + 1):
f.seek(self.line_offsets[i])
line = f.readline().strip()
result.append(line)
return result
⸻
Example Usage:
ds = ContextWindowDataset(“data.csv”, n_previous=3)
print(ds[5]) # Returns [row 2, row 3, row 4, row 5]
⸻
Notes:
• This works with streamed data from disk.
• It avoids loading the full dataset into RAM.
• If your dataset is huge and linearly accessed, you can wrap this with a DataLoader using num_workers=0 to stay sequential.
⸻
LMK if you need help
Yes, it is possible to get multiple rows at once via streaming, depending on the streaming protocol or technology you’re using. For instance, in databases or data streaming platforms like Kafka, you can fetch multiple rows in batches or chunks as part of a continuous stream. This can be achieved by setting batch sizes or using cursors that allow you to retrieve data in bulk, without waiting for each individual row. The exact approach depends on the specific streaming API or database you’re working with.