Datasets caching from_pandas()

jwidmer · November 19, 2025, 3:04pm

Hello everyone
I’m handling a quite big dataset locally - trying to save it to_parquet() after some processing.
Next to some string/int columns, there is one containing images.

I decided to follow those steps:

Reading the data from a ZIP or folder in a pandas.DataFrame. The image column is like: {“bytes”: , “path”: None}.
Creating a Dataset with from_pandas:

features = Features(
    {
        "image": DatasetImage(decode=False),
        "xml_content": Value("string"),
        "filename": Value("string"),
        "project_name": Value("string"),
    }
)

dataset = Dataset.from_pandas(df, preserve_index=False, features=features)

(Optional) Splitting it before make it iterable and iterate over train and test?
Flattening, create an IterableDataset, and go for all the mapping:

dataset = dataset.flatten_indices()
dataset = dataset.to_iterable_dataset()
dataset = dataset.map(...)

The mapping never puts the images encoded in the dataset.
5. Persisting with to_parquet() - that’s it.

My problem is the caching/handling of the dataset before making it iterable.
With a small dataset, it is very fast. But I also have some dataframes with over 15’000 rows.
In those cases, the from_pandas() creates an OOM (also with 64 GB RAM).

Any idea and/or hint how to handle this?
If I would save the images temp from the ZIP-file to disk and let the Dataset load the images - would that help?

Thanks in advance!

Topic		Replies	Views
In-memory dataset to disk for caching operations 🤗Datasets	1	949	May 2, 2022
Pipeline with custom dataset tokenizer: when to save/load manually 🤗Datasets	18	5698	September 18, 2020
How to use several datasets that fit into the RAM? 🤗Datasets	1	510	November 5, 2021
How to load parquet to datasets without caching? 🤗Datasets	1	3440	June 24, 2022
Cant save Dataset as Parquet-File since Updating Datasets? 🤗Datasets	1	2475	May 1, 2021

Datasets caching from_pandas()

Related topics