When using the load_dataset('wikipedia', '20220301.en')
dataset with streaming=True
, it throws a ValueError:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-4-af223b124995> in <cell line: 1>()
----> 1 load_dataset('wikipedia', lang, streaming=True)
1 frames
/usr/local/lib/python3.9/dist-packages/datasets/builder.py in as_streaming_dataset(self, split, base_path)
1247 ) -> Union[Dict[str, IterableDataset], IterableDataset]:
1248 if not isinstance(self, (GeneratorBasedBuilder, ArrowBasedBuilder)):
-> 1249 raise ValueError(f"Builder {self.name} is not streamable.")
1250
1251 is_local = not is_remote_filesystem(self._fs)
ValueError: Builder wikipedia is not streamable.
And to “work-around” it, it seems a little meta (fourth-wall), and this works:
from datasets import load_dataset, IterableDataset
from torch.utils.data import DataLoader
from torchdata.datapipes.iter import IterDataPipe, IterableWrapper
# Load from HF.
_ds = load_dataset('wikipedia', '20220301.en')
def _ds_gen():
for i in range(len(_ds)):
yield _ds['train'][i]
dataloader = DataLoader(
IterableDataset.from_generator(_ds_gen)
)
Is there any way to add above code that generator wrapper over the load_dataset('wikipedia', '20220301.en', streaming=True)
?