An optimal way to perform partitioning of the dataset

If using the shuffle function in the datasets library is acceptable, I think that would be the simplest method, but it seems that it is also possible to recreate a subsample for that particular dataset…