How to split Hugging Face dataset to train and test?

laro1 · July 26, 2022, 12:24pm

I have json file with data which I want to load and split to train and test (70% data for train).

I’m loading the records in this way:

full_path = "/home/ad/ds/fiction"          

data_files = {
            "DATA": os.path.join(full_path, "dev.json")
} 

ds = load_dataset("json", data_files=data_files)
ds

DatasetDict({
    DATA: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 750
    })
})

How can I split this ds to train and test ?
Can I change the DATA label to TRAIN and TEST with the relevant samples ?

stevhliu · July 26, 2022, 4:13pm

Hello and welcome @laro1!

You can use the train_test_split() function and specify the test_size parameter to determine the size of the split. For example:

ds.train_test_split(test_size=0.3)

DatasetDict({
    train: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 525
    })
    test: Dataset({
        features: ['premise', 'hypothesis', 'label'],
        num_rows: 225
    })
})

Check out the docs here and let me know if that helps!

afriedman412 · August 13, 2022, 11:58pm

is there anything like the “stratify” param in scikit-learn?

(or more generally a way to assure class balancing in train and test splits?)

lhoestq · August 19, 2022, 10:26am

Yup, please check the stratify_by_column argument in the docs

>>> ds = load_dataset("imdb",split="train")
Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})
>>> ds = ds.train_test_split(test_size=0.2, stratify_by_column="label")
DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 20000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 5000
    })
})

YanaS · December 22, 2022, 7:39am

When I load my custom dataset from dictionary I get an error:

ValueError: Stratifying by column is only supported for ClassLabel column, and column label is Sequence.

with open(‘/content/drive/MyDrive/all.bio.pickle’, ‘rb’) as f:
bio_dict = pickle.load(f)

ds = datasets.Dataset.from_dict(bio_dict)

Dataset({
features: [‘id’, ‘text’, ‘ner_tags’, ‘input_ids’, ‘attention_mask’, ‘label’],
num_rows: 8805
})

train_testvalid = ds.train_test_split(test_size=0.5, shuffle=True, stratify_by_column=“label”)

test_valid = train_testvalid[‘test’].train_test_split(test_size=0.5, shuffle=True, stratify_by_column=“label”)

ttv_ds = datasets.DatasetDict({
‘train’: train_testvalid[‘train’],
‘validation’: test_valid[‘train’],
‘test’: test_valid[‘test’]})

mkdeeperinsights · January 24, 2023, 4:56pm

Yes this is an annoying error, it looks like they are using sklearn in the background.

One way to overcome this (as long as your labels have at least 2 members per group) is to cast the label as a ClassDict first:

# column we want to stratify with respect to
stratify_column_name = "label"

# create class label column and stratify
dataset.class_encode_column(
    stratify_column_name
).train_test_split(
    test_size=0.2, 
    stratify_by_column=stratify_column_name
)

Topic		Replies	Views
Datasets.load_dataset not returning 'eval' or 'test' 🤗Datasets	2	690	May 17, 2022
AttributeError: 'DatasetDict' object has no attribute 'train_test_split' 🤗Datasets	4	20231	August 5, 2023
Confusion in splitting dataset (from imagefolder) into train, test and validation 🤗Datasets	2	5772	August 12, 2022
How to split main dataset into train, dev, test as DatasetDict 🤗Datasets	21	43180	May 23, 2024
Load pre-existing in-memory splits into a Dataset 🤗Datasets	2	1041	November 16, 2021

How to split Hugging Face dataset to train and test?

Related topics