How to split Hugging Face dataset to train and test?

I have json file with data which I want to load and split to train and test (70% data for train).

I’m loading the records in this way:

full_path = "/home/ad/ds/fiction"          

data_files = {
            "DATA": os.path.join(full_path, "dev.json")
} 

ds = load_dataset("json", data_files=data_files)
ds

DatasetDict({
    DATA: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 750
    })
})
  • How can I split this ds to train and test ?
  • Can I change the DATA label to TRAIN and TEST with the relevant samples ?

Hello and welcome @laro1!

You can use the train_test_split() function and specify the test_size parameter to determine the size of the split. For example:

ds.train_test_split(test_size=0.3)

DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 525
    })
    test: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 225
    })
})

Check out the docs here and let me know if that helps! :hugs:

is there anything like the “stratify” param in scikit-learn?

(or more generally a way to assure class balancing in train and test splits?)

Yup, please check the stratify_by_column argument in the docs

>>> ds = load_dataset("imdb",split="train")
Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})
>>> ds = ds.train_test_split(test_size=0.2, stratify_by_column="label")
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 20000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})