State.json does not reflect to the split of the dataset

Rollerblade128 · October 25, 2023, 3:19pm

My raw_dataset includes a few long documents:

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 174
    })
    test: Dataset({
        features: ['text'],
        num_rows: 31
    })
})

I managed to save it with the raw_dataset.save_to_disk(dataset_path). When I checked the folder structure it looked as it should. train and test subfolders were created. Each of them included 3 files: an arrow data file a data-info.json and a state.json.

I do not understand why the two state.json has the same _split value when the one is for training and the other one is for validation. It is well recognized in the dataset_dict.json file as it includes: {"splits": ["train", "test"]}

So My question is what is the role of the split property of the saved dataset if sate.json can not differentiate them?

Topic		Replies	Views
Saving train/val/test datasets 🤗Datasets	2	3524	August 25, 2021
AttributeError: 'DatasetDict' object has no attribute 'train_test_split' 🤗Datasets	4	19940	August 5, 2023
How to split main dataset into train, dev, test as DatasetDict 🤗Datasets	21	42556	May 23, 2024
`train_test_split` with IterableDataset 🤗Datasets	2	1812	January 26, 2023
How to split Hugging Face dataset to train and test? 🤗Datasets	5	55142	January 24, 2023

State.json does not reflect to the split of the dataset

Related topics