Saving dataset in the current state without cache

Hi!

I am working with a wikipedia dataset and want to distribute it across my machines. It is preprocessed roughly like this

TEST_SIZE = 10_000
VAL_SIZE = 3_000

wiki_processed = datasets.load_dataset("wikipedia", "20200501.en")["train"]
wiki_processed = wiki_processed.map(map_fn, num_proc=20)
wiki_processed = wiki_processed.filter(lambda x: bool(x["text"]), num_proc=20)

dataset_dict_val = wiki_processed.train_test_split(test_size=VAL_SIZE, seed=42)
dataset_dict = dataset_dict_val["train"].train_test_split(test_size=TEST_SIZE, seed=42)
dataset_dict["validation"] = dataset_dict_val["test"]

dataset_dict.save_to_disk("../data/wikipedia_rank")

Originally, wikipedia takes about 16Gb, but wikipedia_rank is more than 90Gb.
This happens, because every subset train, validation, test has the original dataset.arrow in its directory + cache files.

~/d/c/d/wikipedia_rank ❯❯❯ du -hs *
4.0K	dataset_dict.json
16G	test
61G	train
16G	validation
~/d/c/d/wikipedia_rank ❯❯❯ du -hs test/*
4.3M	test/cache-0924ed04d225940e.arrow
4.1M	test/cache-0ca033cdbd106a2d.arrow
4.3M	test/cache-1a97b84558ed7b1d.arrow
4.3M	test/cache-22a7c31dbee9c03c.arrow
80K	test/cache-2bb6f88c5c8f4329.arrow
4.4M	test/cache-2f128b0e055ed3cf.arrow
4.4M	test/cache-3c97985cec531e1a.arrow
4.4M	test/cache-482dc3ba3be09fbc.arrow
4.0M	test/cache-4c932850df42ad39.arrow
4.7M	test/cache-4ce30cf00c698ef6.arrow
4.7M	test/cache-4e36968451cc71ba.arrow
4.5M	test/cache-550a1d3d28c3487a.arrow
4.9M	test/cache-599e1bb413fe3195.arrow
4.2M	test/cache-637a2658f481ab57.arrow
4.7M	test/cache-7c27537563c55dd1.arrow
5.1M	test/cache-924c2f977e10633f.arrow
4.7M	test/cache-9510e0cd305aa3fa.arrow
4.5M	test/cache-9664e702137250f3.arrow
4.4M	test/cache-9c3a9b369f101759.arrow
4.1M	test/cache-ab0b7947030b6ce7.arrow
4.0M	test/cache-e124027841f7ed5b.arrow
16G	test/dataset.arrow  # I want test/dataset.arrow only to hold 10K examples, not the whole wikipedia
200K	test/dataset_info.json
4.0K	test/state.json

What is the recommended way to only save a sampled and preprocessed version of my dataset without cache files?

UPD: this workaround is not feasible in my case even with 1Tb of RAM =(

Here’s a workaround that I figured out, but it is slow and requires to put all of your data into RAM (and I don’t have enough).

dataset_dict = datasets.DatasetDict({
    "train": datasets.arrow_dataset.Dataset.from_dict(dataset_dict["train"].data.to_pydict()),
    "validation": datasets.arrow_dataset.Dataset.from_dict(dataset_dict["validation"].data.to_pydict()),
    "test": datasets.arrow_dataset.Dataset.from_dict(dataset_dict["test"].data.to_pydict()),
})

dataset_dict.save_to_disk("../data/wikipedia_rank_nocache")

Possibly clearing cache before saving would work?

you can disable caching entirely by using:

import datasets

datasets.set_caching_enabled(False)

Thank you adilism and lewtun,

I tried both suggestions, but they did not work out for me :disappointed:.

Neither cleanup_cache_files() nor datasets.set_caching_enabled(False) do not seem to affect saving at all, which is curious. I still have 16Gb for each train, validation, and test in my wikipedia_rank_nocache folders.

The thing that kind of worked is to only create new ArrowDataset objects for validaiton and test. These two are small and fit into memory easily.

dataset_dict["test"] = datasets.arrow_dataset.Dataset.from_dict(dataset_dict_val["test"].data.to_pydict())
dataset_dict["validation"] = datasets.arrow_dataset.Dataset.from_dict(dataset_dict_val["validation"].data.to_pydict())
dataset_dict.save_to_disk("../data/wikipedia_rank_nocache")

I had issues with caches as well, try calling flatten_indices before saving the dataset.

3 Likes

That worked! Thank you!

You are welcome @dropout05 Also, @lhoestq please take note - I didn’t file this issue as it was a) transient across releases, b) it was minor, I found a workaround.

Indeed currently if you slice the dataset in some way (using shard, train_test_split or select for example), then under the hood the actual dataset isn’t changed, but instead an indices mapping is added to avoid having to rewrite a new arrow Table (save time + disk/memory usage). It maps the indices used by getitem to the right rows if the arrow Table.

By default save_to_disk does save the full dataset table + the mapping.

If you want to only save the shard of the dataset instead of the original arrow file + the indices, then you have to call flatten_indices first. It creates a new arrow table by using the right rows of the original table.

The current documentation is missing this, let me update it.

3 Likes

Update, since this thread still has some views:

Now flatten_indices is called before saving the dataset to disk by default, in order to avoid saving the full dataset if there’s only a subset that needs to be written (after calling select or shard for example)