How to split Hugging Face dataset to train and test?

lhoestq · August 19, 2022, 10:26am

Yup, please check the stratify_by_column argument in the docs

>>> ds = load_dataset("imdb",split="train")
Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})
>>> ds = ds.train_test_split(test_size=0.2, stratify_by_column="label")
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 20000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})

Topic		Replies	Views
Confusion in splitting dataset (from imagefolder) into train, test and validation 🤗Datasets	2	5697	August 12, 2022
How do I split a Dataset with only train to train/test? Beginners	1	449	February 21, 2022
How to split main dataset into train, dev, test as DatasetDict 🤗Datasets	21	42066	May 23, 2024
AttributeError: 'DatasetDict' object has no attribute 'train_test_split' 🤗Datasets	4	19665	August 5, 2023
Train_test_split with a dataset loaded from dict Beginners	1	642	November 9, 2022

How to split Hugging Face dataset to train and test?

Related topics