For a project, I am trying to split a data set in a training, validation, and testing data set. In my data, one individual can have multiple entries that are independent of each other. However, I would like to create such splits that one ID is only part of one split. Thus, that instances of one individual (one ID) are only present in either the training, validation, or testing data. I cannot find a solution within the Datasets library for this, does anyone know if this exists or what approach would be best in this case?
For context, I know that sklearn allows to do this through their GroupShuffleSplit and I am looking for something similar for the Datasets component.
Something like this should work:
from sklearn.model_selection import GroupShuffleSplit
import numpy as np
def get_groups(dset: datasets.Dataset) -> list[int]:
"Returns a list of group labels assigned to the dataset's samples"
# First split divides dset into train_test_dset and val_dset
gss1 = GroupShuffleSplit(n_splits=1, train_size=.7, random_state=42)
train_test_idx, val_idx = next(gss1.split(np.ones(len(dset)), groups=get_groups(dset)))
train_test_dset, val_dset = dset.select(train_test_idx), dset.select(val_idx)
# Second split divides train_test_dset into train_dset and test_dset
gss2 = GroupShuffleSplit(n_splits=1, train_size=.7, random_state=42)
train_idx, test_idx = next(gss2.split(np.ones(len(train_test_dset)), groups=get_groups(train_test_dset)))
train_dset, test_dset = train_test_dset.select(train_idx), train_test_dset.select(test_idx)