K fold cross validation

Hi,
I use Trainer() to fine-tune bert-base-cased model on NER task.I split my dataset with sklearn.model_selection.train_test_split .

Now, I want to use k fold cross validation to split dataset and fine-tune the model.
Does anyone try the same way? plz tell me if you have any ideas.

2 Likes

one suggestion would be to use the split functionality of datasets to create your folds as described here: Splits and slicing — datasets 1.6.0 documentation

then you could use a loop to fine-tune on each fold with the trainer and aggregate the predictions per fold

1 Like

Very old thread I know, but here’s an alternative to @lewtun’s solution that I like:

import numpy as np
from sklearn.model_selection import StratifiedKFold

from datasets import load_dataset

# First make the kfold object
folds = StratifiedKFold(n_splits=5)

# Then get the dataset
datasets = load_dataset("glue", "mrpc")

# Now make our splits based off of the labels. 
# We can use `np.zeros()` here since it only works off of indices, we really care about the labels
splits = folds.split(np.zeros(datasets["train"].num_rows), datasets["train"]["label"])

# Finally, do what you want with it
# In this case I'm overriding the train/val/test
for train_idxs, val_idxs in splits:
    datasets = load_dataset("glue", "mrpc")
    datasets["test"] = datasets["validation"]
    datasets["validation"] = datasets["train"].select(val_idxs)
    datasets["train"] = datasets["train"].select(train_idxs)

For a method without having to reload the dataset, you can also do:

from datasets import DatasetDict
fold_dataset = DatasetDict({
    "train":datasets["train"].select(train_idx),
    "validation":datasets["train"].select(val_idx),
    "test":datasets["validation"]
})
import numpy as np
from sklearn.model_selection import StratifiedKFold
from datasets import load_dataset

# First make the kfold object
folds = StratifiedKFold(n_splits=10)

# Now make our splits based off of the labels. 
# We can use `np.zeros()` here since it only works off of indices, we really care about the labels
splits = folds.split(np.zeros(datasets["train"].num_rows), datasets["train"]["tags"])
# Finally, do what you want with it
# In this case I'm overriding the train/val/test
for train_idxs, val_idxs in test:
    fold_dataset = DatasetDict({
        "train":datasets["train"].select(train_idxs),
        "validation":datasets["train"].select(val_idxs),
        "test":datasets["validation"]
    })

How to do it the same for multi class label? I get this error…

ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead - the MultiLabelBinarizer transformer can convert to this format.

But if I add MultiClassBinarizer I get this

splits = folds.split(np.zeros(datasets["train"].num_rows), MultiLabelBinarizer().fit_transform(datasets["train"]["tags"]))

ValueError: Supported target types are: (‘binary’, ‘multiclass’). Got ‘multilabel-indicator’ instead.