How can I grab the first N rows of a Dataset *as* a Dataset object?

If I have a HF Dataset object my_dataset, and I try to grab the first say 100 rows in the most obvious way possible, my_dataset[:100], I tend to not get back another Dataset - I get back a dict or something, usually. This is extremely inconvenient because if I’m doing a quick test, sometimes I want to just stick [:100] into a line of code in order to speed things up, but that doesn’t work if the next thing I do is try to call .map() or something.

Is there a convenient way to just quickly grab a subset of a dataset and have it return an actual Dataset?

# The first 10% of `train` split.
train_10pct_ds = datasets.load_dataset('bookcorpus', split='train[:10%]')

Ref: Splits and slicing — datasets 1.11.0 documentation

2 Likes

dataset.select(range(100)) will give you back the dataset.

7 Likes