K fold cross validation

Hi,
I use Trainer() to fine-tune bert-base-cased model on NER task.I split my dataset with sklearn.model_selection.train_test_split .

Now, I want to use k fold cross validation to split dataset and fine-tune the model.
Does anyone try the same way? plz tell me if you have any ideas.

2 Likes

one suggestion would be to use the split functionality of datasets to create your folds as described here: Splits and slicing — datasets 1.6.0 documentation

then you could use a loop to fine-tune on each fold with the trainer and aggregate the predictions per fold

1 Like

Very old thread I know, but here’s an alternative to @lewtun’s solution that I like:

import numpy as np
from sklearn.model_selection import StratifiedKFold

from datasets import load_dataset

# First make the kfold object
folds = StratifiedKFold(n_splits=5)

# Then get the dataset
datasets = load_dataset("glue", "mrpc")

# Now make our splits based off of the labels. 
# We can use `np.zeros()` here since it only works off of indices, we really care about the labels
splits = folds.split(np.zeros(datasets["train"].num_rows), datasets["train"]["label"])

# Finally, do what you want with it
# In this case I'm overriding the train/val/test
for train_idxs, val_idxs in splits:
    datasets = load_dataset("glue", "mrpc")
    datasets["test"] = datasets["validation"]
    datasets["validation"] = datasets["train"].select(val_idxs)
    datasets["train"] = datasets["train"].select(train_idxs)

For a method without having to reload the dataset, you can also do:

from datasets import DatasetDict
fold_dataset = DatasetDict({
    "train":datasets["train"].select(train_idx),
    "validation":datasets["train"].select(val_idx),
    "test":datasets["validation"]
})
4 Likes
import numpy as np
from sklearn.model_selection import StratifiedKFold
from datasets import load_dataset

# First make the kfold object
folds = StratifiedKFold(n_splits=10)

# Now make our splits based off of the labels. 
# We can use `np.zeros()` here since it only works off of indices, we really care about the labels
splits = folds.split(np.zeros(datasets["train"].num_rows), datasets["train"]["tags"])
# Finally, do what you want with it
# In this case I'm overriding the train/val/test
for train_idxs, val_idxs in test:
    fold_dataset = DatasetDict({
        "train":datasets["train"].select(train_idxs),
        "validation":datasets["train"].select(val_idxs),
        "test":datasets["validation"]
    })

How to do it the same for multi class label? I get this error…

ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead - the MultiLabelBinarizer transformer can convert to this format.

But if I add MultiClassBinarizer I get this

splits = folds.split(np.zeros(datasets["train"].num_rows), MultiLabelBinarizer().fit_transform(datasets["train"]["tags"]))

ValueError: Supported target types are: (‘binary’, ‘multiclass’). Got ‘multilabel-indicator’ instead.

1 Like

I like your approach, as you can use sklean CV methods (in particular, the Grouping ones that aren’t yet in huggingface datasets as of Jan '23).

You do have an issue though: you’ve overloaded the name “datasets”. Your code will run but it’s a bit confusing, as @scostiniano found out!

@scostiniano : Take @muellerzr 's first version of the code up to # Finally, do what I want with it, and replace from there down with the second version of the code, then it should work.

Ideally, you should have somehting like glue_datasets = load_dataset("glue", "mrpc") to avoid the confusion (and retain access to the datasets module, which is no longer accessible with your current code!)

1 Like

is there an example script on how to call the Trainer() 5 times(5fold) with different train-test splits?

1 Like