How to sample dataset according to the index

Hi ! when you use wikidataset[some_indices], it tries to load all the indices you requested in memory, as a python dictionary. This can take some time and fill up your memory.

If you just want to select a subset of your dataset and later train on model on it, you can do

subdataset = wikidataset.select(some_indices)

This returns a new Dataset object that only contains the indices you requested. Moreover this doesn’t bring any data on memory and is pretty fast :slight_smile:

1 Like