K fold cross validation

Hi,
I use Trainer() to fine-tune bert-base-cased model on NER task.I split my dataset with sklearn.model_selection.train_test_split .

Now, I want to use k fold cross validation to split dataset and fine-tune the model.
Does anyone try the same way? plz tell me if you have any ideas.

2 Likes

one suggestion would be to use the split functionality of datasets to create your folds as described here: Splits and slicing — datasets 1.6.0 documentation

then you could use a loop to fine-tune on each fold with the trainer and aggregate the predictions per fold

1 Like

Very old thread I know, but here’s an alternative to @lewtun’s solution that I like:

import numpy as np
from sklearn.model_selection import StratifiedKFold

from datasets import load_dataset

# First make the kfold object
folds = StratifiedKFold(n_splits=5)

# Then get the dataset
datasets = load_dataset("glue", "mrpc")

# Now make our splits based off of the labels. 
# We can use `np.zeros()` here since it only works off of indices, we really care about the labels
splits = folds.split(np.zeros(datasets["train"].num_rows), datasets["train"]["label"])

# Finally, do what you want with it
# In this case I'm overriding the train/val/test
for train_idxs, val_idxs in splits:
    datasets = load_dataset("glue", "mrpc")
    datasets["test"] = datasets["validation"]
    datasets["validation"] = datasets["train"].select(val_idxs)
    datasets["train"] = datasets["train"].select(train_idxs)

For a method without having to reload the dataset, you can also do:

from datasets import DatasetDict
fold_dataset = DatasetDict({
    "train":datasets["train"].select(train_idx),
    "validation":datasets["train"].select(val_idx),
    "test":datasets["validation"]
})