Hello everyone
I’m handling a quite big dataset locally - trying to save it to_parquet() after some processing.
Next to some string/int columns, there is one containing images.
I decided to follow those steps:
- Reading the data from a ZIP or folder in a
pandas.DataFrame. The image column is like: {“bytes”: , “path”: None}. - Creating a Dataset with
from_pandas:
features = Features(
{
"image": DatasetImage(decode=False),
"xml_content": Value("string"),
"filename": Value("string"),
"project_name": Value("string"),
}
)
dataset = Dataset.from_pandas(df, preserve_index=False, features=features)
- (Optional) Splitting it before make it iterable and iterate over train and test?
- Flattening, create an
IterableDataset, and go for all the mapping:
dataset = dataset.flatten_indices()
dataset = dataset.to_iterable_dataset()
dataset = dataset.map(...)
The mapping never puts the images encoded in the dataset.
5. Persisting with to_parquet() - that’s it.
My problem is the caching/handling of the dataset before making it iterable.
With a small dataset, it is very fast. But I also have some dataframes with over 15’000 rows.
In those cases, the from_pandas() creates an OOM (also with 64 GB RAM).
Any idea and/or hint how to handle this?
If I would save the images temp from the ZIP-file to disk and let the Dataset load the images - would that help?
Thanks in advance!