Efficiently slicing dataset

Hi,

What is the most efficient method of slicing a dataset? I have a dataset that only contains the input_ids and attention mask for evaluating my model, both as list (not Tensor). I need to split the dataset into chunks as my GPU Memory is not enough to fit everything in one go.

I tried ds.select(range(0,20)) and I tried ds[0:20]. Both operations take 2 seconds and even If I increase the size from 20 to 50 elements it takes 5 seconds. So basically the time increases linearly with the number of elements.

Is there a way to slice a dataset that is more time efficient?

Thanks!

2 Likes

You can use ds.select(). Make sure to use a recent version of datasets, it has been optimized significantly and is almost instantaneous even for big datasets

1 Like

Thanks for the help!

The problem was somewhere else in the end. The print statements in my code find out which expression takes the longest slowed down my code substantially :man_facepalming:.

I learned my lesson :smile: