Embed Image Bytes with Sequence of Images Data

I am creating a dataset of multimodal documents, where each document contains a sequence of interleaved text and images. Each example in the dataset pertains to a single document, an entry for paragraphs, and an entry for images.

For example, the text entry might look like [a, None, b, None] and the image entry might look like [None, img1, None, img2]. This is inspired by Obelisc as found here: HuggingFaceM4/OBELISC · Datasets at Hugging Face

How can I make the resulting parquet store the bytes of the Image object such that I can read the dataset back in later?

Hi! You can run

from datasets.table import embed_table_storage
dataset  = dataset.map(embed_table_storage, batched=True)

to embed the image data before saving the dataset as Parquet.