laro1
July 26, 2022, 12:24pm
1
I have json file with data which I want to load and split to train and test (70% data for train).
Iām loading the records in this way:
full_path = "/home/ad/ds/fiction"
data_files = {
"DATA": os.path.join(full_path, "dev.json")
}
ds = load_dataset("json", data_files=data_files)
ds
DatasetDict({
DATA: Dataset({
features: ['premise', 'hypothesis', 'label'],
num_rows: 750
})
})
How can I split this ds
to train and test ?
Can I change the DATA
label to TRAIN
and TEST
with the relevant samples ?
2 Likes
Hello and welcome @laro1 !
You can use the train_test_split()
function and specify the test_size
parameter to determine the size of the split. For example:
ds.train_test_split(test_size=0.3)
DatasetDict({
train: Dataset({
features: ['premise', 'hypothesis', 'label'],
num_rows: 525
})
test: Dataset({
features: ['premise', 'hypothesis', 'label'],
num_rows: 225
})
})
Check out the docs here and let me know if that helps!
9 Likes
is there anything like the āstratifyā param in scikit-learn?
(or more generally a way to assure class balancing in train and test splits?)
2 Likes
Yup, please check the stratify_by_column
argument in the docs
>>> ds = load_dataset("imdb",split="train")
Dataset({
features: ['text', 'label'],
num_rows: 25000
})
>>> ds = ds.train_test_split(test_size=0.2, stratify_by_column="label")
DatasetDict({
train: Dataset({
features: ['text', 'label'],
num_rows: 20000
})
test: Dataset({
features: ['text', 'label'],
num_rows: 5000
})
})
4 Likes
YanaS
December 22, 2022, 7:39am
5
When I load my custom dataset from dictionary I get an error:
ValueError: Stratifying by column is only supported for ClassLabel column, and column label is Sequence.
with open(ā/content/drive/MyDrive/all.bio.pickleā, ārbā) as f:
bio_dict = pickle.load(f)
ds = datasets.Dataset.from_dict(bio_dict)
Dataset({
features: [āidā, ātextā, āner_tagsā, āinput_idsā, āattention_maskā, ālabelā],
num_rows: 8805
})
train_testvalid = ds.train_test_split(test_size=0.5, shuffle=True, stratify_by_column=ālabelā)
test_valid = train_testvalid[ātestā].train_test_split(test_size=0.5, shuffle=True, stratify_by_column=ālabelā)
ttv_ds = datasets.DatasetDict({
ātrainā: train_testvalid[ātrainā],
āvalidationā: test_valid[ātrainā],
ātestā: test_valid[ātestā]})
Yes this is an annoying error, it looks like they are using sklearn
in the background.
One way to overcome this (as long as your labels have at least 2 members per group) is to cast the label as a ClassDict first:
# column we want to stratify with respect to
stratify_column_name = "label"
# create class label column and stratify
dataset.class_encode_column(
stratify_column_name
).train_test_split(
test_size=0.2,
stratify_by_column=stratify_column_name
)
4 Likes