`save_to_disk` saving ALL data, even items I filtered out

treehugg3 · August 21, 2025, 4:13pm

When loading a dataset and saving via save_to_disk like this:

from datasets import load_dataset

# Load the dataset
dataset = load_dataset(input_dir)

# Filter the dataset
user_wins_dataset = dataset["train"].filter(lambda x: x["winner"] == "user")

# Save the filtered dataset... saves the filtered-out rows, too :(
user_wins_dataset.save_to_disk(output_dir)

The written pyarrow files contain the entire dataset, even all the rows I filtered out. This is severely slowing down my data pipeline because the dataset is massive. Why is datasets not saving only the filtered rows, as I would expect, and how can I make it do that?

When I load the saved dataset, the len is correct (24680), much reduced, but I notice the dataset_info.json file contains the original size (28608139).

flatten_indices doesn’t appear to help:

d2 = d.flatten_indices(keep_in_memory=True)
d2.save_to_disk("flattened") # Same size, all rows saved :(

lhoestq · August 21, 2025, 4:34pm

Hi ! which version of python and `datasets` are you using ? Does it also happen with `datasets==4.0.0` ?

treehugg3 · August 21, 2025, 4:42pm

Hi! I was on datasets==2.21.0, so after reading your reply (thank you!) I updated to datasets==4.0.0 as you suggested, and tried it again, including using the flatten_indices function. Unfortunately even on this version the size of the saved dataset is still the same as the size of the original, even though len(d2) == 24680 (meaning the filter was in place for this object), so there is still an issue.

Topic		Replies	Views
Saving a dataset to disk after select copies the data 🤗Datasets	8	2323	April 7, 2022
Saving dataset in the current state without cache 🤗Datasets	9	5912	March 17, 2022
Loading dataset from disk taking more time than expected 🤗Datasets	0	717	March 14, 2022
Saving train/val/test datasets 🤗Datasets	2	3578	August 25, 2021
Cant save Dataset as Parquet-File since Updating Datasets? 🤗Datasets	1	2470	May 1, 2021

`save_to_disk` saving ALL data, even items I filtered out

Related topics