My raw_dataset
includes a few long documents:
DatasetDict({
train: Dataset({
features: ['text'],
num_rows: 174
})
test: Dataset({
features: ['text'],
num_rows: 31
})
})
I managed to save it with the raw_dataset.save_to_disk(dataset_path)
. When I checked the folder structure it looked as it should. train and test subfolders were created. Each of them included 3 files: an arrow data file a data-info.json and a state.json.
I do not understand why the two state.json has the same _split value when the one is for training and the other one is for validation. It is well recognized in the dataset_dict.json file as it includes: {"splits": ["train", "test"]}
So My question is what is the role of the split property of the saved dataset if sate.json can not differentiate them?