How to load a large hf dataset efficiently?

It’s generally a good idea indeed if you want to save disk space or to not have to wait to download the full dataset.

For example you can stream the dataset using the datasets library, by passing streaming=True to load_dataset().

However even in streaming mode, you better have multiple shards in order to do parallel streaming. And ideally use file formats that work well with streaming like WebDataset or Parquet.

@aaditya did try streaming mode on this dataset but the custom loading script of this dataset uses the jsonlines package that we don’t support for streaming (we only extend the builtin open() for streaming)