Embed Image Bytes with Sequence of Images Data

ydalmia · August 3, 2023, 3:42am

I am creating a dataset of multimodal documents, where each document contains a sequence of interleaved text and images. Each example in the dataset pertains to a single document, an entry for paragraphs, and an entry for images.

For example, the text entry might look like [a, None, b, None] and the image entry might look like [None, img1, None, img2]. This is inspired by Obelisc as found here: HuggingFaceM4/OBELISC · Datasets at Hugging Face

How can I make the resulting parquet store the bytes of the Image object such that I can read the dataset back in later?

mariosasko · August 16, 2023, 3:29pm

Hi! You can run

from datasets.table import embed_table_storage
...
dataset  = dataset.map(embed_table_storage, batched=True)

to embed the image data before saving the dataset as Parquet.

Topic		Replies	Views
List of Images in a parquet dataset 🤗Datasets	1	84	March 25, 2025
How to publish a text to-image dataset on huggingface 🤗Datasets	1	61	February 9, 2025
Parquet image dataset 🤗Datasets	6	1098	July 10, 2024
Parquet compression for image dataset 🤗Datasets	5	3132	December 7, 2023
Best Practices for Large-Scale Image Datasets? (between WebDataset and Parquet) 🤗Datasets	3	300	February 8, 2025

Embed Image Bytes with Sequence of Images Data

Related topics