Say I want to train on a dataset such as togethercomputer/RedPajama-Data-V2 路 Datasets at Hugging Face which only defines a train split. Is it possible to both stream a dataset and create new validation/train splits similar to this example transformers/examples/pytorch/language-modeling/run_clm_no_trainer.py at main 路 huggingface/transformers 路 GitHub (which does not use streaming)? I鈥檝e copied over the relevant bits.
if args.dataset_name is not None:
# Downloading and loading a dataset from the hub.
raw_datasets = load_dataset(args.dataset_name, args.dataset_config_name)
if "validation" not in raw_datasets.keys():
raw_datasets["validation"] = load_dataset(
args.dataset_name,
args.dataset_config_name,
split=f"train[:{args.validation_split_percentage}%]",
)
raw_datasets["train"] = load_dataset(
args.dataset_name,
args.dataset_config_name,
split=f"train[{args.validation_split_percentage}%:]",
)