Hi!
I’m looking to stream a dataset where for each getitem call I need the next row and n previous rows. I couldn’t find a way to do this by googling.
Help appreciated!
Thanks!
Hi!
I’m looking to stream a dataset where for each getitem call I need the next row and n previous rows. I couldn’t find a way to do this by googling.
Help appreciated!
Thanks!
To stream a dataset where each getitem call returns the current row and the n previous rows, you can create a custom PyTorch Dataset class (or use similar logic in another framework). Here’s a minimal, streaming-compatible pattern without loading everything into memory.
⸻
Goal:
For a given index i, return:
[rows[i - n], …, rows[i - 1], rows[i]]
⸻
Key Design Elements:
• Use a buffer to store the most recent n rows.
• Stream data line by line (e.g. from disk or a file-like iterator).
• Keep track of line positions for seek-style recall (if file-based).
⸻
Example: Custom PyTorch Dataset (streaming from file)
import torch
from torch.utils.data import Dataset
from collections import deque
class ContextWindowDataset(Dataset):
def init(self, filepath, n_previous):
self.filepath = filepath
self.n = n_previous
self.line_offsets =
# Precompute line start byte positions
with open(filepath, 'r') as f:
offset = 0
for line in f:
self.line_offsets.append(offset)
offset += len(line)
self.total_lines = len(self.line_offsets)
def __len__(self):
return self.total_lines
def __getitem__(self, index):
if index < self.n:
raise IndexError(f"Index {index} too small for {self.n} previous rows")
result = []
with open(self.filepath, 'r') as f:
for i in range(index - self.n, index + 1):
f.seek(self.line_offsets[i])
line = f.readline().strip()
result.append(line)
return result
⸻
Example Usage:
ds = ContextWindowDataset(“data.csv”, n_previous=3)
print(ds[5]) # Returns [row 2, row 3, row 4, row 5]
⸻
Notes:
• This works with streamed data from disk.
• It avoids loading the full dataset into RAM.
• If your dataset is huge and linearly accessed, you can wrap this with a DataLoader using num_workers=0 to stay sequential.
⸻
LMK if you need help