AttributeError: 'DatasetDict' object has no attribute 'train_test_split'

Shouldn’t this work?

dataset = load_dataset('json', data_files='path/to/file')
dataset.train_test_split(test_size=0.15)

I’m getting this following error:

Using custom data configuration default

Downloading and preparing dataset json/default-cf892ee5bc3fc36a (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/json/default-cf892ee5bc3fc36a/0.0.0/70d89ed4db1394f028c651589fcab6d6b28dddcabbe39d3b21b4d41f9a708514...
Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-cf892ee5bc3fc36a/0.0.0/70d89ed4db1394f028c651589fcab6d6b28dddcabbe39d3b21b4d41f9a708514. Subsequent calls will reuse this data.
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-3-59d55201b8c3> in <module>()
      1 dataset = load_dataset('json', data_files='/path/to/file')
----> 2 dataset.train_test_split(test_size=0.15)
      3 dataset.shard(10)
      4 dataset

AttributeError: 'DatasetDict' object has no attribute 'train_test_split'
1 Like

Hi @thecity2, as far as I know train_test_split operates on Dataset objects, not DatasetDict objects.

For example, this works

squad = (load_dataset('squad', split='train')
        .train_test_split(train_size=800, test_size=200))

because I’ve picked the train split and so load_dataset returns a Dataset object. On the other hand, this does not work:

squad = load_dataset('squad').train_test_split(train_size=800, test_size=200)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-10-d3fb264651eb> in <module>
----> 1 squad = load_dataset('squad').train_test_split(train_size=800, test_size=200)

AttributeError: 'DatasetDict' object has no attribute 'train_test_split'

It seems that your load_dataset is returning the latter, so you could try applying train_test_split on one of the Dataset objects that lives in your dataset.

13 Likes

@lewtun You’re 100% correct. Thanks!

1 Like

or this can also works

squad = load_dataset('squad')['train'].train_test_split(train_size=800, test_size=200)
4 Likes

This worked for my case. Thanks

dataset = load_dataset('csv', split='train', data_files='train.csv')