Hi, I am training BERT and use the dataset wikipedia. Only a subset of the inputs are needed and I have got the indicecs of them. However, problems occur when I want to use the sub-dataset. This code is extremely slow : subdataset=wikidataset[selected_indices].
, where the selected_indices is a one-dimension vector. I thought this may due to the dataset is too large. Is there any way to sample the dataset efficiently?
By the way, I also considered to use SubsetRandomSampler, but it seems this sampler does not work in the distributed training.
Hi ! when you use wikidataset[some_indices], it tries to load all the indices you requested in memory, as a python dictionary. This can take some time and fill up your memory.
If you just want to select a subset of your dataset and later train on model on it, you can do
subdataset = wikidataset.select(some_indices)
This returns a new Dataset object that only contains the indices you requested. Moreover this doesn’t bring any data on memory and is pretty fast