How to split main dataset into train, dev, test as DatasetDict

Hi Bram,
Yes the documentation of train_test_split that you link to is the right one. The train_test_split method currently provided is just a copy of the famous sklearn train_test_split (that we kinda assume people to be familiar with), we just removed the stratified split options which are quite complex.
We could add an option to split in three with a validation split indeed, feel free to open a PR on this if you would like to have this feature fast.
Right now what you can do is splitting two times:

# 90% train, 10% test + validation
train_testvalid = dataset.train_test_split(test=0.1)
# Split the 10% test + valid in half test, half valid
test_valid = train_test_dataset['test'].train_test_split(test=0.5)
# gather everyone if you want to have a single DatasetDict
train_test_valid_dataset = DatasetDict({
    'train': train_testvalid['train'],
    'test': test_valid['test'],
    'valid': test_valid['train']})

The mention of a validation split that you point to is just an enum provided for dataset creators who would like to include a standard name for a validation split in there dataset.

We will try to add more doc on the code organization but bear in mind that (1) this library is still very young and (2) we are a lot less working on it (it’s really mostly one person, Quentin, that I try to help as much as I can), so it will definitely take some time before we have “have clear, verbose documentation about as many aspects of the library as possible”.

Basically, to give you an idea, the code is organized in two main parts:

  1. the dataset building part which is defined in part by the people writing datasets and very open (hence the many options for splits in this part) => this is most of the complex code because it a wrapper around script provided externally. This includes files like builder.py, load.py, arrow_dataset.py.
  2. the dataset processing part (after the dataset has been build) which is mostly contained in the arrow_dataset.py file and contains most of what the users will actually interact with => this is probably the part you need to read the most. The main complex part here is that we are deeply integrated with Apache Arrow which is very efficient but definitely not the easiest framework to understand.

You can also read this part in the doc where I tried to make a graph and give some information on how datasets are created (the first part in my list above): https://huggingface.co/docs/datasets/add_dataset.html

15 Likes