Saving a dataset to disk after select copies the data

PaulLerner · March 30, 2022, 2:02pm

Hi,

As you can see in datasets/arrow_dataset.py at 2.0.0 · huggingface/datasets · GitHub when selecting indices from dataset A for dataset B, it keeps the same data as A. I guess this is the expected behavior so I did not open an issue.
The problem is when saving the dataset B to disk, since the data of A was not filtered, the whole data is saved to disk.

This is problematic in my use case: training and test splits (I am aware of the train_test_split method but I need some specific sampling). Indeed, if I have my original dataset A and I split it in TRAIN and TEST using select, then save them using save_to_disk, my data will be duplicated: in both TRAIN and TEST.

Any idea on how to fix this/what am I supposed to do to save a dataset after selection without copying the whole data ?

Bests,

Paul

PaulLerner · March 30, 2022, 2:24pm

Actually it seems that train_test_split also uses select datasets/arrow_dataset.py at 2.0.0 · huggingface/datasets · GitHub so it must have the same problem?

PaulLerner · March 30, 2022, 2:41pm

Found a (not so satisfying) work-around: d = d.filter(lambda x: True) before d.save_to_disk

mariosasko · March 30, 2022, 4:02pm

Hi! The same data shouldn’t be saved twice because save_to_disk calls flatten_indices to save only the selected rows (referenced by the _indices mapping) from the table. Why do you think that’s not the case?

PaulLerner · March 30, 2022, 5:19pm

No idea

$ datasets-cli env
- `datasets` version: 1.8.0
- Platform: Linux-4.18.0-305.40.2.el8_4.x86_64-x86_64-with-redhat-8.4-Ootpa
- Python version: 3.7.11
- PyArrow version: 3.0.0

mariosasko · March 30, 2022, 5:22pm

Could you please try with the newest version of datasets and report back? It can be installed as follows:

pip install -U datasets

PaulLerner · April 6, 2022, 12:43pm

Even worse! With datasets 2.0.0, if I load the previously saved subset, it loads the whole dataset instead of the selected indices.

mariosasko · April 7, 2022, 12:10pm

Can you share the reproducer? Feel free to replace the original data with dummy data (to keep it private).

PaulLerner · April 7, 2022, 1:23pm

Never mind, I did not want to update because of the issue on the dataset that I had already saved, but updating to datasets 1.18.3 solves the issue (probably future versions as well, I tried 1.18.3 because it was already installed on another machine).

Thank you for your help!

See for instance:

In [2]: from datasets import Dataset

In [3]: d = Dataset.from_dict({'foo':[1]*10000})

In [4]: d
Out[4]: 
Dataset({
    features: ['foo'],
    num_rows: 10000
})

In [5]: d.save_to_disk('foo')

In [7]: ls -lh foo
total 88K
-rw-rw-r-- 1 lerner lerner 79K avril  7 15:18 dataset.arrow
-rw-rw-r-- 1 lerner lerner 480 avril  7 15:18 dataset_info.json
-rw-rw-r-- 1 lerner lerner 253 avril  7 15:18 state.json

In [8]: d=d.select([0,1])

In [9]: d
Out[9]: 
Dataset({
    features: ['foo'],
    num_rows: 2
})

In [10]: d.save_to_disk('bar')
Flattening the indices: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 25.38ba/s]

In [11]: ls -lsh bar
total 12K
4,0K -rw-rw-r-- 1 lerner lerner 440 avril  7 15:19 dataset.arrow
4,0K -rw-rw-r-- 1 lerner lerner 480 avril  7 15:19 dataset_info.json
4,0K -rw-rw-r-- 1 lerner lerner 253 avril  7 15:19 state.json

Topic		Replies	Views
Saving train/val/test datasets 🤗Datasets	2	3524	August 25, 2021
Working with large datasets 🤗Datasets	5	4141	November 10, 2020
What is the diffrence between copy.deepcopy and flatten_indices? 🤗Datasets	1	2590	July 20, 2021
State.json does not reflect to the split of the dataset Beginners	0	248	October 25, 2023
Does saving a shuffled dataset to arrow format eliminate the indirection? 🤗Datasets	3	96	December 4, 2024

Saving a dataset to disk after select copies the data

Related topics