How to sample dataset according to the index

lhoestq · January 10, 2022, 11:46am

Hi ! when you use wikidataset[some_indices], it tries to load all the indices you requested in memory, as a python dictionary. This can take some time and fill up your memory.

If you just want to select a subset of your dataset and later train on model on it, you can do

subdataset = wikidataset.select(some_indices)

This returns a new Dataset object that only contains the indices you requested. Moreover this doesn’t bring any data on memory and is pretty fast

Topic		Replies	Views
Index retrieval speed varies considerably with dataset size 🤗Datasets	2	863	May 9, 2022
Slow in generating train split when loading local dataset 🤗Datasets	1	1568	January 12, 2024
Loading just part of dataset 🤗Datasets	4	4682	February 25, 2025
How can I download a sizable subset of a dataset 🤗Datasets	1	793	April 3, 2024
Querying column is slow for datasets with indices mapping 🤗Datasets	3	1483	May 17, 2021

How to sample dataset according to the index

Related topics