Uploading dataset compressing images

Hey! I’m having some trouble uploading a VQA dataset to the hub via python’s DatasetDict.push_to_hub(). Everything seems to work perfectly except the images, they seem to be converted from png to jpg.

The original dataset can be downloaded from here: Remote Sensing VQA - Low Resolution (RSVQA LR)

Here’s the dataset: exibings/rsvqa-lr · Datasets at Hugging Face

For comparision I was using the MMMU dataset: MMMU/MMMU · Datasets at Hugging Face and here the images are displayed as pngs. (You can try to download an image directly from the dataset viewer and they will download as png, if you try the same thing with my dataset you’ll notice that the images have low-res and download as jpg)

I’m unsure if it’s something I’m doing incorrectly when creating the DatasetDict that I upload.
This is the code snippet of out I build each Dataset of the DatasetDict:

def process_split(questions: list, answers: list, split: str, dataset: Literal["lr", "hr"]) -> Dataset:
    records = {
        "type": [],
        "question": [],
        "img_id": [],
        "img": [],
        "answer": [],

    for question, answer in zip(questions, answers):
        if question["active"] and answer["active"]:
            if question["type"] == "count" and dataset == "lr":
            elif question["type"] == "area" and dataset == "hr":

            records["img"].append(os.path.join("data", f"rsvqa-{dataset}", "images", f"{question['img_id']}.tif"))
    return Dataset.from_dict(records, split=split).cast_column("img", Image())

Hi ! The Hugging Face Hub may convert the images to another format to show them in your browser.

Under the hood the images in your dataset are still in their original format. Indeed datasets doesn’t convert images when pushing to the Hub.

1 Like

Oh ok, maybe I should have tried to actually download it and check :sweat_smile::sweat_smile:

Anyhow, just for the sake of trying to learn. How come the MMMU dataset images are presented in .png format in the browser ? My guess was that it might be cause of the auto-parquet conversion (or lack of thereof in their case), not sure if that’s the case or not but would love to know more!

Surely because the dataset was already in PNG, and in the browser we try to use either JPEG or PNG