How to split a dataset into train, test, and validation?

Raffal · September 23, 2020, 9:23pm

I am having difficulties trying to figure out how I can split my dataset into train, test, and validation. I’ve been going through the documentation here:

github.com

huggingface/datasets/blob/37d4840a39eeff5d472beb890c8f850dc7723bb8/datasets/wikihow/wikihow.py

# coding=utf-8
# Copyright 2020 The TensorFlow Datasets Authors and the HuggingFace Datasets Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Lint as: python3
"""WikiHow Datasets."""

from __future__ import absolute_import, division, print_function

This file has been truncated. show original

and the template here:

github.com

huggingface/datasets/blob/master/templates/new_dataset_script.py#L63



        Args:
            data_size: the size of the training set we want to us (xs, s, m, l, xl)
            **kwargs: keyword arguments forwarded to super.
        """
        self.data_size = data_size


class NewDataset(datasets.GeneratorBasedBuilder):
    """TODO: Short description of my dataset."""

    VERSION = datasets.Version("1.1.0")

    # This is an example of a dataset with multiple configurations.
    # If you don't want/need to define several sub-sets in your dataset,
    # just remove the BUILDER_CONFIG_CLASS and the BUILDER_CONFIGS attributes.
    BUILDER_CONFIG_CLASS = NewDatasetConfig
    BUILDER_CONFIGS = [
        NewDatasetConfig(name="my_dataset_" + size, description="A small dataset", data_size=size) for size in ["small", "medium", "large"]
    ]

but it hasn’t become any clearer.

this is the error I keep getting:
TypeError: ‘NoneType’ object is not callable

and this is the code I’m using:

def _split_generators(self, dl_manager):
    """Returns SplitGenerators."""
    dl_path = dl_manager.download_and_extract(_URLS)
    titles = {k: set() for k in dl_path}
    for k, path in dl_path.items():
        with open(path, encoding="utf-8") as f:
            for line in f:
                titles[k].add(line.strip())

    path_to_manual_file = os.path.join(
        os.path.abspath(os.path.expanduser(dl_manager.manual_dir)), self.config.filename
    )

    if not os.path.exists(path_to_manual_file):
        raise FileNotFoundError(
            "{} does not exist. Make sure you insert a manual dir via `datasets.load_dataset('wikihow', data_dir=...)` that includes a file name {}. Manual download instructions: {})".format(
                path_to_manual_file, self.config.filename, self.manual_download_instructions
            )
        )
    return [
        datasets.SplitGenerator(
            name=datasets.Split.TRAIN,
            gen_kwargs={
                "path": path_to_manual_file,
                "title_set": titles["train"],
            },
        ),
        datasets.SplitGenerator(
            name=datasets.Split.VALIDATION,
            gen_kwargs={
                "path": path_to_manual_file,
                "title_set": titles["validation"],
            },
        ),
        datasets.SplitGenerator(
            name=datasets.Split.TEST,
            gen_kwargs={
                "path": path_to_manual_file,
                "title_set": titles["test"],
            },
        ),

valhalla · September 24, 2020, 4:22pm

I think it’s answered here
How to split main dataset into train, dev, test as DatasetDict

brando · May 17, 2024, 5:17pm

from datasets import load_dataset, DatasetDict

# Load a dataset from Hugging Face
dataset = load_dataset('squad', split='train')

# Split the dataset into training and validation sets
# Specify the fraction for the test set (validation set)
train_val_split = dataset.train_test_split(test_size=0.1)

# Extract the training and validation datasets
train_dataset = train_val_split['train']
val_dataset = train_val_split['test']

# Print the size of the datasets
print(f"Training set size: {len(train_dataset)}")
print(f"Validation set size: {len(val_dataset)}")

# Save the datasets if needed
# train_dataset.save_to_disk('path/to/train_dataset')
# val_dataset.save_to_disk('path/to/val_dataset')

Topic		Replies	Views
Don't know how to split imdb to train, test, validation 🤗Datasets	0	337	May 6, 2024
Confusion in splitting dataset (from imagefolder) into train, test and validation 🤗Datasets	2	5739	August 12, 2022
How to split main dataset into train, dev, test as DatasetDict 🤗Datasets	21	42702	May 23, 2024
Splitting Dataset in the dataset loading script 🤗Datasets	1	601	September 16, 2022
Datasets.load_dataset not returning 'eval' or 'test' 🤗Datasets	2	684	May 17, 2022

How to split a dataset into train, test, and validation?

Related topics