[Q] How to assure no overlap in IDs between train, test, and validation split?

RikRaes · September 8, 2023, 11:58am

For a project, I am trying to split a data set in a training, validation, and testing data set. In my data, one individual can have multiple entries that are independent of each other. However, I would like to create such splits that one ID is only part of one split. Thus, that instances of one individual (one ID) are only present in either the training, validation, or testing data. I cannot find a solution within the Datasets library for this, does anyone know if this exists or what approach would be best in this case?

For context, I know that sklearn allows to do this through their GroupShuffleSplit and I am looking for something similar for the Datasets component.

mariosasko · September 13, 2023, 5:45pm

Something like this should work:

from sklearn.model_selection import GroupShuffleSplit
import numpy as np

def get_groups(dset: datasets.Dataset) -> list[int]:
    "Returns a list of group labels assigned to the dataset's samples"
    ...

# First split divides dset into train_test_dset and val_dset
gss1 = GroupShuffleSplit(n_splits=1, train_size=.7, random_state=42)
train_test_idx, val_idx = next(gss1.split(np.ones(len(dset)), groups=get_groups(dset)))
train_test_dset, val_dset = dset.select(train_test_idx), dset.select(val_idx)

# Second split divides train_test_dset into train_dset and test_dset
gss2 = GroupShuffleSplit(n_splits=1, train_size=.7, random_state=42)
train_idx, test_idx = next(gss2.split(np.ones(len(train_test_dset)), groups=get_groups(train_test_dset)))
train_dset, test_dset = train_test_dset.select(train_idx), train_test_dset.select(test_idx)

Topic		Replies	Views
How to split main dataset into train, dev, test as DatasetDict 🤗Datasets	21	42455	May 23, 2024
Not declaring splits inside of dataset loading script 🤗Datasets	2	1596	July 28, 2022
Is there a way to split dataset in Specific range? 🤗Datasets	1	250	July 7, 2023
Confusion in splitting dataset (from imagefolder) into train, test and validation 🤗Datasets	2	5725	August 12, 2022
How to split Hugging Face dataset to train and test? 🤗Datasets	5	55015	January 24, 2023

[Q] How to assure no overlap in IDs between train, test, and validation split?

Related topics