Saving train/val/test datasets

ThomasG · August 23, 2021, 10:37pm

Hi everyone.

After creating a dataset consisting of all my data, I split it in train/validation/test sets. Following that, I am performing a number of preprocessing steps on all of them, and end up with three altered datasets, of type datasets.arrow_dataset.Dataset.

In order to save them and in the future load directly the preprocessed datasets, would I have to call

dataset.save_to_disk(FILE_PATH)

3 times, one for the training, one for the validation and one for the test set? Or is there any way to somehow save them all together? If yes, what is more efficient?

Thanks in advance.

lhoestq · August 25, 2021, 10:35am

Hi !

You can save them all as a dataset dictionary:

from datasets import DatasetDict, load_from_disk

dataset = DatasetDict({
    "train": train_dataset,
    "validation": validation_dataset,
    "test": test_dataset,
})

dataset.save_to_disk("path/to/dataset/dir")

# reload
dataset = load_from_disk("path/to/dataset/dir")

# access any split
train_dataset = dataset["train"]

This is especially useful to save several splits of a dataset together.

ThomasG · August 25, 2021, 7:09pm

This is exactly what I was looking for!

Thanks a lot

Topic		Replies	Views
Load pre-existing in-memory splits into a Dataset 🤗Datasets	2	1025	November 16, 2021
Saving a dataset to disk after select copies the data 🤗Datasets	8	2296	April 7, 2022
Split DataFrame into validation and train split 🤗Datasets	2	6498	April 11, 2022
How to split main dataset into train, dev, test as DatasetDict 🤗Datasets	21	42557	May 23, 2024
Loading multiple custom splits using `load_dataset('audiofolder', data_dir=/some/path)` Beginners	4	769	November 13, 2023

Saving train/val/test datasets

Related topics