I came across this Pokemon BLIP dataset: lambdalabs/pokemon-blip-captions · Datasets at Hugging Face
The dataset table is just images of pokemon with a corresponding text input.
My dataset looks like this: oo92/diffusion-dataset · Datasets at Hugging Face
Its just URLs of images with corresponding text (and other entries).
I have the images downloaded from each URL. How can I format my dataset to look like that of Pokemon BLIP?
Hi! To render the images, the “url” column would have to be of type
datasets.Image() instead of
datasets.Value("string"). Unfortunately, changing the default types of the datasets generated with the packaged loaders (
csv is one of them) on the Hub is impossible at the moment. We plan to address this soon. In the meantime, you can get the same format by doing the following:
dset = load_dataset("oo92/diffusion-dataset") # reads csv data from the repo
dset = dset.rename_column("url", "image") # renames the "url" column to "image" (feel free to skip this step)
dset = dset.cast_column("image", datasets.Image()) # casts the "image" column from Value("string") to Image()
dset.push_to_hub("oo92/diffusion-dataset") # pushes the "fixed" dataset to the Hub as Parquet
Also, the code makes no link to the data.csv
Added some comments to my code to explain what it does