How to sample dataset according to the index

ezio98 · December 18, 2021, 8:19am

Hi, I am training BERT and use the dataset wikipedia. Only a subset of the inputs are needed and I have got the indicecs of them. However, problems occur when I want to use the sub-dataset. This code is extremely slow :
subdataset=wikidataset[selected_indices].
, where the selected_indices is a one-dimension vector. I thought this may due to the dataset is too large. Is there any way to sample the dataset efficiently?

By the way, I also considered to use SubsetRandomSampler, but it seems this sampler does not work in the distributed training.

lhoestq · January 10, 2022, 11:46am

Hi ! when you use wikidataset[some_indices], it tries to load all the indices you requested in memory, as a python dictionary. This can take some time and fill up your memory.

If you just want to select a subset of your dataset and later train on model on it, you can do

subdataset = wikidataset.select(some_indices)

This returns a new Dataset object that only contains the indices you requested. Moreover this doesn’t bring any data on memory and is pretty fast

ezio98 · January 10, 2022, 12:03pm

Thanks! This is exactly what I want.

Topic		Replies	Views
Index retrieval speed varies considerably with dataset size 🤗Datasets	2	863	May 9, 2022
Slow in generating train split when loading local dataset 🤗Datasets	1	1568	January 12, 2024
Loading just part of dataset 🤗Datasets	4	4682	February 25, 2025
How can I download a sizable subset of a dataset 🤗Datasets	1	793	April 3, 2024
Querying column is slow for datasets with indices mapping 🤗Datasets	3	1483	May 17, 2021

How to sample dataset according to the index

Related topics