Download only a subset of a split


I was wondering if is there a way to download only part of the data of a dataset.
In my specific case, I need to download only X samples from oscar English split (X~100K samples).
When I try to invoke the dataset builder it asks for >1TB of space so I think it will download the full set of data at the beginning.

Hello :hugs:

You can load a part of split by slicing:

train_10_20_ds = datasets.load_dataset('bookcorpus', split='train[10:20]')

You can refer to more ways of slicing and loading here.

Thank you, can I also use streaming mode to reach the same?

Hi, let me just complete this: split='train[10:20]' returns a slice of the data, but it still downloads everything.

If your dataset is too big, please use streaming mode. You can also slice your dataset in streaming mode, see the documentation here: Stream

Thank you, that’s the case indeed.