Datasets caching from_pandas()

Hello everyone
I’m handling a quite big dataset locally - trying to save it to_parquet() after some processing.
Next to some string/int columns, there is one containing images.

I decided to follow those steps:

  1. Reading the data from a ZIP or folder in a pandas.DataFrame. The image column is like: {“bytes”: , “path”: None}.
  2. Creating a Dataset with from_pandas:
features = Features(
    {
        "image": DatasetImage(decode=False),
        "xml_content": Value("string"),
        "filename": Value("string"),
        "project_name": Value("string"),
    }
)

dataset = Dataset.from_pandas(df, preserve_index=False, features=features)
  1. (Optional) Splitting it before make it iterable and iterate over train and test?
  2. Flattening, create an IterableDataset, and go for all the mapping:
dataset = dataset.flatten_indices()
dataset = dataset.to_iterable_dataset()
dataset = dataset.map(...)

The mapping never puts the images encoded in the dataset.
5. Persisting with to_parquet() - that’s it.

My problem is the caching/handling of the dataset before making it iterable.
With a small dataset, it is very fast. But I also have some dataframes with over 15’000 rows.
In those cases, the from_pandas() creates an OOM (also with 64 GB RAM).

Any idea and/or hint how to handle this?
If I would save the images temp from the ZIP-file to disk and let the Dataset load the images - would that help?

Thanks in advance!