Efficiently slicing dataset

JustSaX · December 18, 2022, 3:00pm

Hi,

What is the most efficient method of slicing a dataset? I have a dataset that only contains the input_ids and attention mask for evaluating my model, both as list (not Tensor). I need to split the dataset into chunks as my GPU Memory is not enough to fit everything in one go.

I tried ds.select(range(0,20)) and I tried ds[0:20]. Both operations take 2 seconds and even If I increase the size from 20 to 50 elements it takes 5 seconds. So basically the time increases linearly with the number of elements.

Is there a way to slice a dataset that is more time efficient?

Thanks!

lhoestq · December 18, 2022, 11:14pm

You can use ds.select(). Make sure to use a recent version of datasets, it has been optimized significantly and is almost instantaneous even for big datasets

JustSaX · December 22, 2022, 3:54pm

Thanks for the help!

The problem was somewhere else in the end. The print statements in my code find out which expression takes the longest slowed down my code substantially .

I learned my lesson

Topic		Replies	Views
Is `dataset.select(range(10000))` efficient? 🤗Datasets	1	342	July 18, 2023
Efficiently slicing imagefolder dataset split 🤗Datasets	9	1427	December 16, 2022
How to slice an already loaded Dataset? 🤗Datasets	2	5736	December 16, 2022
Is there a way to split dataset in Specific range? 🤗Datasets	1	250	July 7, 2023
Querying column is slow for datasets with indices mapping 🤗Datasets	3	1483	May 17, 2021

Efficiently slicing dataset

Related topics