Possible to stream and create new splits?

ivnle · December 27, 2023, 11:01pm

Say I want to train on a dataset such as togethercomputer/RedPajama-Data-V2 · Datasets at Hugging Face which only defines a train split. Is it possible to both stream a dataset and create new validation/train splits similar to this example transformers/examples/pytorch/language-modeling/run_clm_no_trainer.py at main · huggingface/transformers · GitHub (which does not use streaming)? I’ve copied over the relevant bits.

    if args.dataset_name is not None:
        # Downloading and loading a dataset from the hub.
        raw_datasets = load_dataset(args.dataset_name, args.dataset_config_name)
        if "validation" not in raw_datasets.keys():
            raw_datasets["validation"] = load_dataset(
                args.dataset_name,
                args.dataset_config_name,
                split=f"train[:{args.validation_split_percentage}%]",
            )
            raw_datasets["train"] = load_dataset(
                args.dataset_name,
                args.dataset_config_name,
                split=f"train[{args.validation_split_percentage}%:]",
            )

lhoestq · January 4, 2024, 5:51pm

This is not possible for this dataset.

However we plan to make it possible to load percentages of datasets in streaming mode for datasets made of multiples files (shards) and if the number of examples per file is available (this way streaming can skip files until the right percentage of examples is reached)

Topic		Replies	Views
How to split a Hugging Face dataset in streaming mode without loading it into memory? Beginners	0	213	May 17, 2024
How to use Huggingface Trainer streaming Datasets without wrapping it with torchdata's IterableWrapper? 🤗Datasets	1	4575	October 30, 2022
Confusion in splitting dataset (from imagefolder) into train, test and validation 🤗Datasets	2	5728	August 12, 2022
How to split a dataset into train, test, and validation? Beginners	2	32171	May 17, 2024
Torchrun, trainer, dataset setup Intermediate	4	814	December 20, 2024

Possible to stream and create new splits?

Related topics