How to create a dataset script for the LAION dataset

Hi.

I’ve been struggling with this for a while. I came across a notebook that works with the Pokemon BLIP captions dataset. I have a dataset that has the same format, which is the LAION Aesthetic 6pls.

In the huggingface docs, I got the impression that I’d have to write a dataset script like the one for Food-101 dataset but I couldn’t figure out how to apply this to LAION.

My goal is to upload the LAION dataset to hugginface in the same way Pokemon BLIP captions have. One column of images and the other column of captions.

I don’t know if a dataset script is even necessary for this but I’d like to know how to go about it.

Hi ! You can just load your dataset in python this way:

from datasets import load_dataset, Dataset, Image

texts = ["text of img0", "text of img1", ...]
image_paths = ["path_or_url/to/img0.png", "path_or_url/to/img1.png", ...]

ds = Dataset.from_dict({"image": image_paths, "text": texts})
ds = ds.cast_column("image", Image())
# the dataset is ready, and you can even share it on the Hugging face Hub
ds.push_to_hub("oo92/my_dataset")
# and later
ds = load_dataset("oo92/my_dataset")
1 Like