How to change the format of a dataset


I came across this Pokemon BLIP dataset: lambdalabs/pokemon-blip-captions 路 Datasets at Hugging Face

The dataset table is just images of pokemon with a corresponding text input.

My dataset looks like this: oo92/diffusion-dataset 路 Datasets at Hugging Face

Its just URLs of images with corresponding text (and other entries).

I have the images downloaded from each URL. How can I format my dataset to look like that of Pokemon BLIP?

Hi! To render the images, the 鈥渦rl鈥 column would have to be of type datasets.Image() instead of datasets.Value("string"). Unfortunately, changing the default types of the datasets generated with the packaged loaders (csv is one of them) on the Hub is impossible at the moment. We plan to address this soon. In the meantime, you can get the same format by doing the following:

import datasets
dset = load_dataset("oo92/diffusion-dataset") # reads csv data from the repo
dset = dset.rename_column("url", "image") # renames the "url" column to "image" (feel free to skip this step) 
dset = dset.cast_column("image", datasets.Image()) # casts the "image" column from Value("string") to Image()
dset.push_to_hub("oo92/diffusion-dataset") # pushes the "fixed" dataset to the Hub as Parquet

Also, the code makes no link to the data.csv

Added some comments to my code to explain what it does