Hi ! when you use wikidataset[some_indices]
, it tries to load all the indices you requested in memory, as a python dictionary. This can take some time and fill up your memory.
If you just want to select a subset of your dataset and later train on model on it, you can do
subdataset = wikidataset.select(some_indices)
This returns a new Dataset
object that only contains the indices you requested. Moreover this doesn’t bring any data on memory and is pretty fast