How do I make a dataset for vision models?

marksuccsmfewercoc · April 15, 2024, 3:28pm

So I was reading this blog Vision Language Models Explained on fine-tuning LLaVA from a dataset from HF, and I was wondering - How can I upload my custom dataset on HF to do that same with my data? Can anyone help me out with that?

Currently, I have an images folder that contains all the images and have a file named data.json in my root folder. How will I process would look like? I already read some tutorials on docs.

data.json:

[
  {
    "role": "user",
    "content": [
      { "type": "text", "text": "What’s in this image?" },
      { "type": "image" }
    ]
  },
  { "role": "assistant", "content": [{ "type": "text", "text": "DOG" }] },
  {
    "role": "user",
    "content": [
      { "type": "text", "text": "What’s in this image?" },
      { "type": "image" }
    ]
  },
  { "role": "assistant", "content": [{ "type": "text", "text": "CAT" }] },
  {
    "role": "user",
    "content": [
      { "type": "text", "text": "What’s in this image?" },
      { "type": "image" }
    ]
  },
  { "role": "assistant", "content": [{ "type": "text", "text": "DUCK" }] }
]

marksuccsmfewercoc · April 16, 2024, 5:41am

Any idea? @lhoestq @mariosasko

lhoestq · April 16, 2024, 10:06am

In the blog they use the dataset at HuggingFaceH4/llava-instruct-mix-vsft · Datasets at Hugging Face, so you should use the same format. I notice your already have the JSON part in the right format, you just need to associate the JSON part with the images.

You can do so programmatically:

from datasets import Dataset, Image, Sequence

ds = Dataset.from_dict({"messages": your_data_json_messages_list})
ds = ds.add_column("images", [[img_path] for img_path in img_paths])  # contains lists of 1 image
ds = ds.cast_column("images", Sequence(Image()))  # from string type to image type

# Optionally push to the HF Hub
ds.push_to_hub("username/datasetname")

marksuccsmfewercoc · April 16, 2024, 10:29am

Getting this error “ValueError: Failed to concatenate on axis=1 because tables don’t have the same number of rows”

import os
import json
from datasets import Dataset, Image, Sequence

with open("data.json", "r") as f:
    your_data_json = json.load(f)

img_folder = "images/"
img_paths = [os.path.join(img_folder, img_name) for img_name in os.listdir(img_folder) if img_name.endswith(('.jpg', '.jpeg', '.png'))]

ds = Dataset.from_dict({"messages": your_data_json})

ds = ds.add_column("images", [[img_path] for img_path in img_paths])  

ds = ds.cast_column("images", Sequence(Image()))

ds.push_to_hub("marksuccsmfewercoc/test")

lhoestq · April 16, 2024, 10:52am

You should have as many images as items in your messages list for it to work. In particular you can make sure each item in the messages list is a list of one user message and one assistant message.

marksuccsmfewercoc · April 16, 2024, 11:14am

But the LLava dataset and mine dataset look very different how to fix it?

here is what the correct dataset looks like HuggingFaceH4/llava-instruct-mix-vsft · Datasets at Hugging Face and here is what my dataset looks like marksuccsmfewercoc/test · Datasets at Hugging Face

lhoestq · April 16, 2024, 11:41am

The messages column should contain lists (one list per image), for example the first list can look like

[
  {
    "content": [ { "index": null, "text": "What’s in this image?\n", "type": "text" }, { "index": 0, "text": null, "type": "image" } ],
    "role": "user"
  },
  {
    "content": [ { "index": null, "text": "DOG", "type": "text" } ],
    "role": "assistant" 
  }
]

marksuccsmfewercoc · April 16, 2024, 11:58am

So what does my JSON file look like? before uploading it to the hub

marksuccsmfewercoc · April 16, 2024, 3:39pm

Can you please tell me what’s the correct format to make it work with fine-tuning LLaVA according to the blog by HF?

I just have three images in my images folder.

marksuccsmfewercoc · April 17, 2024, 11:51am

Hey @lhoestq I’m still getting “raise ValueError(“Failed to concatenate on axis=1 because tables don’t have the same number of rows”) ValueError: Failed to concatenate on axis=1 because tables don’t have the same number of rows”.

But when I print the length of images and JSON it prints 3 and 3 which is correct. Any idea about what can be the issue?


import os
import json
from datasets import Dataset, Image, Sequence

with open("data.json", "r") as f:
    your_data_json = json.load(f)

print(len(your_data_json))
print(len(os.listdir('images/')))

ds = Dataset.from_dict({"messages": your_data_json})
ds = ds.add_column("images", [[img_path] for img_path in 'images/']) 
ds = ds.cast_column("images", Sequence(Image()))  

ds.push_to_hub("marksuccsmfewercoc/test")

I don’t understand if both len(your_data_json) value and (len(os.listdir(‘images/’)) are 3 then what’s the issue?

lhoestq · April 19, 2024, 8:48am

Hi ! I think there is a small bug in your code:

- ds = ds.add_column("images", [[img_path] for img_path in 'images/']) 
+ ds = ds.add_column("images", [[img_path] for img_path in os.listdir('images/')])

marksuccsmfewercoc · April 20, 2024, 8:20am

Very confusing, Now I’m getting FileNotFoundError: [Errno 2] No such file or directory: ‘dog.png’ error, But I do have a dog.png file in my images folder

marksuccsmfewercoc · April 20, 2024, 2:54pm

Hey @lhoestq I fixed that problem but the images and dataset are not in proper sequence, how to fix this? How to link images with each conversations in the JSON data?

Topic		Replies	Views
[NEWBY] Creating custom datasets to fine tune an existing model Beginners	0	300	November 4, 2022
Loading Script for Vision Dataset 🤗Datasets	1	208	September 20, 2023
How Do I make a Dataset Beginners	0	40	July 17, 2024
How does one actually create a new dataset? Beginners	2	3248	October 18, 2024
Prakash Hinduja Geneva, Switzerland - How to fine-tune a model on custom dataset in HF? Beginners	2	45	June 6, 2025

How do I make a dataset for vision models?

Related topics