Hi!
I am working with a wikipedia dataset and want to distribute it across my machines. It is preprocessed roughly like this
TEST_SIZE = 10_000
VAL_SIZE = 3_000
wiki_processed = datasets.load_dataset("wikipedia", "20200501.en")["train"]
wiki_processed = wiki_processed.map(map_fn, num_proc=20)
wiki_processed = wiki_processed.filter(lambda x: bool(x["text"]), num_proc=20)
dataset_dict_val = wiki_processed.train_test_split(test_size=VAL_SIZE, seed=42)
dataset_dict = dataset_dict_val["train"].train_test_split(test_size=TEST_SIZE, seed=42)
dataset_dict["validation"] = dataset_dict_val["test"]
dataset_dict.save_to_disk("../data/wikipedia_rank")
Originally, wikipedia takes about 16Gb, but wikipedia_rank is more than 90Gb.
This happens, because every subset train
, validation
, test
has the original dataset.arrow
in its directory + cache files.
~/d/c/d/wikipedia_rank ❯❯❯ du -hs *
4.0K dataset_dict.json
16G test
61G train
16G validation
~/d/c/d/wikipedia_rank ❯❯❯ du -hs test/*
4.3M test/cache-0924ed04d225940e.arrow
4.1M test/cache-0ca033cdbd106a2d.arrow
4.3M test/cache-1a97b84558ed7b1d.arrow
4.3M test/cache-22a7c31dbee9c03c.arrow
80K test/cache-2bb6f88c5c8f4329.arrow
4.4M test/cache-2f128b0e055ed3cf.arrow
4.4M test/cache-3c97985cec531e1a.arrow
4.4M test/cache-482dc3ba3be09fbc.arrow
4.0M test/cache-4c932850df42ad39.arrow
4.7M test/cache-4ce30cf00c698ef6.arrow
4.7M test/cache-4e36968451cc71ba.arrow
4.5M test/cache-550a1d3d28c3487a.arrow
4.9M test/cache-599e1bb413fe3195.arrow
4.2M test/cache-637a2658f481ab57.arrow
4.7M test/cache-7c27537563c55dd1.arrow
5.1M test/cache-924c2f977e10633f.arrow
4.7M test/cache-9510e0cd305aa3fa.arrow
4.5M test/cache-9664e702137250f3.arrow
4.4M test/cache-9c3a9b369f101759.arrow
4.1M test/cache-ab0b7947030b6ce7.arrow
4.0M test/cache-e124027841f7ed5b.arrow
16G test/dataset.arrow # I want test/dataset.arrow only to hold 10K examples, not the whole wikipedia
200K test/dataset_info.json
4.0K test/state.json
What is the recommended way to only save a sampled and preprocessed version of my dataset without cache files?
UPD: this workaround is not feasible in my case even with 1Tb of RAM =(
Here’s a workaround that I figured out, but it is slow and requires to put all of your data into RAM (and I don’t have enough).
dataset_dict = datasets.DatasetDict({
"train": datasets.arrow_dataset.Dataset.from_dict(dataset_dict["train"].data.to_pydict()),
"validation": datasets.arrow_dataset.Dataset.from_dict(dataset_dict["validation"].data.to_pydict()),
"test": datasets.arrow_dataset.Dataset.from_dict(dataset_dict["test"].data.to_pydict()),
})
dataset_dict.save_to_disk("../data/wikipedia_rank_nocache")
Possibly clearing cache before saving would work?
you can disable caching entirely by using:
import datasets
datasets.set_caching_enabled(False)
Thank you adilism and lewtun,
I tried both suggestions, but they did not work out for me .
Neither cleanup_cache_files()
nor datasets.set_caching_enabled(False)
do not seem to affect saving at all, which is curious. I still have 16Gb for each train, validation, and test in my wikipedia_rank_nocache
folders.
The thing that kind of worked is to only create new ArrowDataset
objects for validaiton and test. These two are small and fit into memory easily.
dataset_dict["test"] = datasets.arrow_dataset.Dataset.from_dict(dataset_dict_val["test"].data.to_pydict())
dataset_dict["validation"] = datasets.arrow_dataset.Dataset.from_dict(dataset_dict_val["validation"].data.to_pydict())
dataset_dict.save_to_disk("../data/wikipedia_rank_nocache")
I had issues with caches as well, try calling flatten_indices before saving the dataset.
3 Likes
You are welcome @dropout05 Also, @lhoestq please take note - I didn’t file this issue as it was a) transient across releases, b) it was minor, I found a workaround.
Indeed currently if you slice the dataset in some way (using shard
, train_test_split
or select
for example), then under the hood the actual dataset isn’t changed, but instead an indices mapping is added to avoid having to rewrite a new arrow Table (save time + disk/memory usage). It maps the indices used by getitem to the right rows if the arrow Table.
By default save_to_disk
does save the full dataset table + the mapping.
If you want to only save the shard of the dataset instead of the original arrow file + the indices, then you have to call flatten_indices
first. It creates a new arrow table by using the right rows of the original table.
The current documentation is missing this, let me update it.
3 Likes
Update, since this thread still has some views:
Now flatten_indices
is called before saving the dataset to disk by default, in order to avoid saving the full dataset if there’s only a subset that needs to be written (after calling select
or shard
for example)