Possible to stream and create new splits?

Say I want to train on a dataset such as togethercomputer/RedPajama-Data-V2 路 Datasets at Hugging Face which only defines a train split. Is it possible to both stream a dataset and create new validation/train splits similar to this example transformers/examples/pytorch/language-modeling/run_clm_no_trainer.py at main 路 huggingface/transformers 路 GitHub (which does not use streaming)? I鈥檝e copied over the relevant bits.

    if args.dataset_name is not None:
        # Downloading and loading a dataset from the hub.
        raw_datasets = load_dataset(args.dataset_name, args.dataset_config_name)
        if "validation" not in raw_datasets.keys():
            raw_datasets["validation"] = load_dataset(
                args.dataset_name,
                args.dataset_config_name,
                split=f"train[:{args.validation_split_percentage}%]",
            )
            raw_datasets["train"] = load_dataset(
                args.dataset_name,
                args.dataset_config_name,
                split=f"train[{args.validation_split_percentage}%:]",
            )

This is not possible for this dataset.

However we plan to make it possible to load percentages of datasets in streaming mode for datasets made of multiples files (shards) and if the number of examples per file is available (this way streaming can skip files until the right percentage of examples is reached)