K fold cross validation

muellerzr · April 19, 2022, 2:29pm

Very old thread I know, but here’s an alternative to @lewtun’s solution that I like:

import numpy as np
from sklearn.model_selection import StratifiedKFold

from datasets import load_dataset

# First make the kfold object
folds = StratifiedKFold(n_splits=5)

# Then get the dataset
datasets = load_dataset("glue", "mrpc")

# Now make our splits based off of the labels. 
# We can use `np.zeros()` here since it only works off of indices, we really care about the labels
splits = folds.split(np.zeros(datasets["train"].num_rows), datasets["train"]["label"])

# Finally, do what you want with it
# In this case I'm overriding the train/val/test
for train_idxs, val_idxs in splits:
    datasets = load_dataset("glue", "mrpc")
    datasets["test"] = datasets["validation"]
    datasets["validation"] = datasets["train"].select(val_idxs)
    datasets["train"] = datasets["train"].select(train_idxs)

For a method without having to reload the dataset, you can also do:

from datasets import DatasetDict
fold_dataset = DatasetDict({
    "train":datasets["train"].select(train_idx),
    "validation":datasets["train"].select(val_idx),
    "test":datasets["validation"]
})

Topic		Replies	Views
Do transformers need Cross-Validation Beginners	4	7317	April 1, 2023
Specifying K-fold splits in a dataset 🤗Datasets	1	591	March 20, 2024
Implement k-fold cross validation for hyperparameter tuning Beginners	0	1190	May 27, 2022
Percent slicing and rounding + Stratify Beginners	1	441	June 19, 2023
Kfolds leaking into subsequent folds Beginners	0	91	April 18, 2024

K fold cross validation

Related topics