I tried using load_dataset with streaming = Ture for tiiuae/falcon-refinedweb dataset. However, I am unsure how I should iterate through the data.
I called the iter() method and next() but I get stopiteration error. TIA!
Can you share the whole code block with terminal output?
IterableDatasetDict
is a dictionary-like object that maps split names to the corresponding Dataset
objects, which can iterate over the data. So, to iterate over the data, you must select a split first and then iterate over the returned dataset:
from datasets import load_dataset
ds = load_dataset("tiiuae/falcon-refinedweb", streaming=True)
# iter/next
it = iter(ds["train"])
next(it)
# or for loop
for ex in ds["train"]:
pass