It’s generally a good idea indeed if you want to save disk space or to not have to wait to download the full dataset.
For example you can stream the dataset using the datasets
library, by passing streaming=True
to load_dataset()
.
However even in streaming mode, you better have multiple shards in order to do parallel streaming. And ideally use file formats that work well with streaming like WebDataset or Parquet.
@aaditya did try streaming mode on this dataset but the custom loading script of this dataset uses the jsonlines
package that we don’t support for streaming (we only extend the builtin open()
for streaming)