Hi, I am training BERT and use the dataset wikipedia. Only a subset of the inputs are needed and I have got the indicecs of them. However, problems occur when I want to use the sub-dataset. This code is extremely slow :
, where the selected_indices is a one-dimension vector. I thought this may due to the dataset is too large. Is there any way to sample the dataset efficiently?
By the way, I also considered to use
SubsetRandomSampler, but it seems this sampler does not work in the distributed training.