Saving a dataset to disk after select copies the data


As you can see in datasets/ at 2.0.0 路 huggingface/datasets 路 GitHub when selecting indices from dataset A for dataset B, it keeps the same data as A. I guess this is the expected behavior so I did not open an issue.
The problem is when saving the dataset B to disk, since the data of A was not filtered, the whole data is saved to disk.

This is problematic in my use case: training and test splits (I am aware of the train_test_split method but I need some specific sampling). Indeed, if I have my original dataset A and I split it in TRAIN and TEST using select, then save them using save_to_disk, my data will be duplicated: in both TRAIN and TEST.

Any idea on how to fix this/what am I supposed to do to save a dataset after selection without copying the whole data ?



Actually it seems that train_test_split also uses select datasets/ at 2.0.0 路 huggingface/datasets 路 GitHub so it must have the same problem?

Found a (not so satisfying) work-around: d = d.filter(lambda x: True) before d.save_to_disk

Hi! The same data shouldn鈥檛 be saved twice because save_to_disk calls flatten_indices to save only the selected rows (referenced by the _indices mapping) from the table. Why do you think that鈥檚 not the case?

No idea :thinking:

$ datasets-cli env
- `datasets` version: 1.8.0
- Platform: Linux-4.18.0-305.40.2.el8_4.x86_64-x86_64-with-redhat-8.4-Ootpa
- Python version: 3.7.11
- PyArrow version: 3.0.0

Could you please try with the newest version of datasets and report back? It can be installed as follows:

pip install -U datasets

Even worse! With datasets 2.0.0, if I load the previously saved subset, it loads the whole dataset instead of the selected indices.

Can you share the reproducer? Feel free to replace the original data with dummy data (to keep it private).

Never mind, I did not want to update because of the issue on the dataset that I had already saved, but updating to datasets 1.18.3 solves the issue (probably future versions as well, I tried 1.18.3 because it was already installed on another machine).

Thank you for your help!

See for instance:

In [2]: from datasets import Dataset

In [3]: d = Dataset.from_dict({'foo':[1]*10000})

In [4]: d
    features: ['foo'],
    num_rows: 10000

In [5]: d.save_to_disk('foo')

In [7]: ls -lh foo
total 88K
-rw-rw-r-- 1 lerner lerner 79K avril  7 15:18 dataset.arrow
-rw-rw-r-- 1 lerner lerner 480 avril  7 15:18 dataset_info.json
-rw-rw-r-- 1 lerner lerner 253 avril  7 15:18 state.json

In [8]:[0,1])

In [9]: d
    features: ['foo'],
    num_rows: 2

In [10]: d.save_to_disk('bar')
Flattening the indices: 100%|鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅鈻堚枅| 1/1 [00:00<00:00, 25.38ba/s]

In [11]: ls -lsh bar
total 12K
4,0K -rw-rw-r-- 1 lerner lerner 440 avril  7 15:19 dataset.arrow
4,0K -rw-rw-r-- 1 lerner lerner 480 avril  7 15:19 dataset_info.json
4,0K -rw-rw-r-- 1 lerner lerner 253 avril  7 15:19 state.json