K fold cross validation

yucheng · April 26, 2021, 9:48am

Hi,
I use Trainer() to fine-tune bert-base-cased model on NER task.I split my dataset with sklearn.model_selection.train_test_split .

Now, I want to use k fold cross validation to split dataset and fine-tune the model.
Does anyone try the same way? plz tell me if you have any ideas.

lewtun · April 26, 2021, 10:16am

one suggestion would be to use the split functionality of datasets to create your folds as described here: Splits and slicing — datasets 1.6.0 documentation

then you could use a loop to fine-tune on each fold with the trainer and aggregate the predictions per fold

muellerzr · April 19, 2022, 2:29pm

Very old thread I know, but here’s an alternative to @lewtun’s solution that I like:

import numpy as np
from sklearn.model_selection import StratifiedKFold

from datasets import load_dataset

# First make the kfold object
folds = StratifiedKFold(n_splits=5)

# Then get the dataset
datasets = load_dataset("glue", "mrpc")

# Now make our splits based off of the labels. 
# We can use `np.zeros()` here since it only works off of indices, we really care about the labels
splits = folds.split(np.zeros(datasets["train"].num_rows), datasets["train"]["label"])

# Finally, do what you want with it
# In this case I'm overriding the train/val/test
for train_idxs, val_idxs in splits:
    datasets = load_dataset("glue", "mrpc")
    datasets["test"] = datasets["validation"]
    datasets["validation"] = datasets["train"].select(val_idxs)
    datasets["train"] = datasets["train"].select(train_idxs)

For a method without having to reload the dataset, you can also do:

from datasets import DatasetDict
fold_dataset = DatasetDict({
    "train":datasets["train"].select(train_idx),
    "validation":datasets["train"].select(val_idx),
    "test":datasets["validation"]
})

scostiniano · November 27, 2022, 4:53am

import numpy as np
from sklearn.model_selection import StratifiedKFold
from datasets import load_dataset

# First make the kfold object
folds = StratifiedKFold(n_splits=10)

# Now make our splits based off of the labels. 
# We can use `np.zeros()` here since it only works off of indices, we really care about the labels
splits = folds.split(np.zeros(datasets["train"].num_rows), datasets["train"]["tags"])
# Finally, do what you want with it
# In this case I'm overriding the train/val/test
for train_idxs, val_idxs in test:
    fold_dataset = DatasetDict({
        "train":datasets["train"].select(train_idxs),
        "validation":datasets["train"].select(val_idxs),
        "test":datasets["validation"]
    })

How to do it the same for multi class label? I get this error…

ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead - the MultiLabelBinarizer transformer can convert to this format.

But if I add MultiClassBinarizer I get this

splits = folds.split(np.zeros(datasets["train"].num_rows), MultiLabelBinarizer().fit_transform(datasets["train"]["tags"]))

ValueError: Supported target types are: (‘binary’, ‘multiclass’). Got ‘multilabel-indicator’ instead.

drevicko · January 9, 2023, 3:57am

I like your approach, as you can use sklean CV methods (in particular, the Grouping ones that aren’t yet in huggingface datasets as of Jan '23).

You do have an issue though: you’ve overloaded the name “datasets”. Your code will run but it’s a bit confusing, as @scostiniano found out!

@scostiniano : Take @muellerzr 's first version of the code up to # Finally, do what I want with it, and replace from there down with the second version of the code, then it should work.

Ideally, you should have somehting like glue_datasets = load_dataset("glue", "mrpc") to avoid the confusion (and retain access to the datasets module, which is no longer accessible with your current code!)

fkov · July 29, 2023, 4:24am

is there an example script on how to call the Trainer() 5 times(5fold) with different train-test splits?

Topic		Replies	Views
Do transformers need Cross-Validation Beginners	4	7328	April 1, 2023
Specifying K-fold splits in a dataset 🤗Datasets	1	595	March 20, 2024
Implement k-fold cross validation for hyperparameter tuning Beginners	0	1190	May 27, 2022
Percent slicing and rounding + Stratify Beginners	1	442	June 19, 2023
Kfolds leaking into subsequent folds Beginners	0	91	April 18, 2024

K fold cross validation

Related topics