K fold cross validation

I use Trainer() to fine-tune bert-base-cased model on NER task.I split my dataset with sklearn.model_selection.train_test_split .

Now, I want to use k fold cross validation to split dataset and fine-tune the model.
Does anyone try the same way? plz tell me if you have any ideas.


one suggestion would be to use the split functionality of datasets to create your folds as described here: Splits and slicing — datasets 1.6.0 documentation

then you could use a loop to fine-tune on each fold with the trainer and aggregate the predictions per fold

1 Like

Very old thread I know, but here’s an alternative to @lewtun’s solution that I like:

import numpy as np
from sklearn.model_selection import StratifiedKFold

from datasets import load_dataset

# First make the kfold object
folds = StratifiedKFold(n_splits=5)

# Then get the dataset
datasets = load_dataset("glue", "mrpc")

# Now make our splits based off of the labels. 
# We can use `np.zeros()` here since it only works off of indices, we really care about the labels
splits = folds.split(np.zeros(datasets["train"].num_rows), datasets["train"]["label"])

# Finally, do what you want with it
# In this case I'm overriding the train/val/test
for train_idxs, val_idxs in splits:
    datasets = load_dataset("glue", "mrpc")
    datasets["test"] = datasets["validation"]
    datasets["validation"] = datasets["train"].select(val_idxs)
    datasets["train"] = datasets["train"].select(train_idxs)

For a method without having to reload the dataset, you can also do:

from datasets import DatasetDict
fold_dataset = DatasetDict({