How to sample dataset according to the index

Hi, I am training BERT and use the dataset wikipedia. Only a subset of the inputs are needed and I have got the indicecs of them. However, problems occur when I want to use the sub-dataset. This code is extremely slow :
subdataset=wikidataset[selected_indices].
, where the selected_indices is a one-dimension vector. I thought this may due to the dataset is too large. Is there any way to sample the dataset efficiently?

By the way, I also considered to use SubsetRandomSampler, but it seems this sampler does not work in the distributed training.

Hi ! when you use wikidataset[some_indices], it tries to load all the indices you requested in memory, as a python dictionary. This can take some time and fill up your memory.

If you just want to select a subset of your dataset and later train on model on it, you can do

subdataset = wikidataset.select(some_indices)

This returns a new Dataset object that only contains the indices you requested. Moreover this doesn’t bring any data on memory and is pretty fast :slight_smile:

1 Like

Thanks! This is exactly what I want.