How do I make a dataset for vision models?

So I was reading this blog Vision Language Models Explained on fine-tuning LLaVA from a dataset from HF, and I was wondering - How can I upload my custom dataset on HF to do that same with my data? Can anyone help me out with that?

Currently, I have an images folder that contains all the images and have a file named data.json in my root folder. How will I process would look like? I already read some tutorials on docs.

data.json:

[
  {
    "role": "user",
    "content": [
      { "type": "text", "text": "What’s in this image?" },
      { "type": "image" }
    ]
  },
  { "role": "assistant", "content": [{ "type": "text", "text": "DOG" }] },
  {
    "role": "user",
    "content": [
      { "type": "text", "text": "What’s in this image?" },
      { "type": "image" }
    ]
  },
  { "role": "assistant", "content": [{ "type": "text", "text": "CAT" }] },
  {
    "role": "user",
    "content": [
      { "type": "text", "text": "What’s in this image?" },
      { "type": "image" }
    ]
  },
  { "role": "assistant", "content": [{ "type": "text", "text": "DUCK" }] }
]

Any idea? @lhoestq @mariosasko

In the blog they use the dataset at HuggingFaceH4/llava-instruct-mix-vsft ¡ Datasets at Hugging Face, so you should use the same format. I notice your already have the JSON part in the right format, you just need to associate the JSON part with the images.

You can do so programmatically:

from datasets import Dataset, Image, Sequence

ds = Dataset.from_dict({"messages": your_data_json_messages_list})
ds = ds.add_column("images", [[img_path] for img_path in img_paths])  # contains lists of 1 image
ds = ds.cast_column("images", Sequence(Image()))  # from string type to image type

# Optionally push to the HF Hub
ds.push_to_hub("username/datasetname")

Getting this error “ValueError: Failed to concatenate on axis=1 because tables don’t have the same number of rows”

import os
import json
from datasets import Dataset, Image, Sequence

with open("data.json", "r") as f:
    your_data_json = json.load(f)

img_folder = "images/"
img_paths = [os.path.join(img_folder, img_name) for img_name in os.listdir(img_folder) if img_name.endswith(('.jpg', '.jpeg', '.png'))]

ds = Dataset.from_dict({"messages": your_data_json})

ds = ds.add_column("images", [[img_path] for img_path in img_paths])  

ds = ds.cast_column("images", Sequence(Image()))

ds.push_to_hub("marksuccsmfewercoc/test")

You should have as many images as items in your messages list for it to work. In particular you can make sure each item in the messages list is a list of one user message and one assistant message.

But the LLava dataset and mine dataset look very different how to fix it?

here is what the correct dataset looks like HuggingFaceH4/llava-instruct-mix-vsft ¡ Datasets at Hugging Face and here is what my dataset looks like marksuccsmfewercoc/test ¡ Datasets at Hugging Face

The messages column should contain lists (one list per image), for example the first list can look like

[
  {
    "content": [ { "index": null, "text": "What’s in this image?\n", "type": "text" }, { "index": 0, "text": null, "type": "image" } ],
    "role": "user"
  },
  {
    "content": [ { "index": null, "text": "DOG", "type": "text" } ],
    "role": "assistant" 
  }
]

So what does my JSON file look like? before uploading it to the hub

Can you please tell me what’s the correct format to make it work with fine-tuning LLaVA according to the blog by HF?

I just have three images in my images folder.

Hey @lhoestq I’m still getting “raise ValueError(“Failed to concatenate on axis=1 because tables don’t have the same number of rows”) ValueError: Failed to concatenate on axis=1 because tables don’t have the same number of rows”.

But when I print the length of images and JSON it prints 3 and 3 which is correct. Any idea about what can be the issue?


import os
import json
from datasets import Dataset, Image, Sequence

with open("data.json", "r") as f:
    your_data_json = json.load(f)

print(len(your_data_json))
print(len(os.listdir('images/')))

ds = Dataset.from_dict({"messages": your_data_json})
ds = ds.add_column("images", [[img_path] for img_path in 'images/']) 
ds = ds.cast_column("images", Sequence(Image()))  

ds.push_to_hub("marksuccsmfewercoc/test")

I don’t understand if both len(your_data_json) value and (len(os.listdir(‘images/’)) are 3 then what’s the issue?

Hi ! I think there is a small bug in your code:

- ds = ds.add_column("images", [[img_path] for img_path in 'images/']) 
+ ds = ds.add_column("images", [[img_path] for img_path in os.listdir('images/')]) 

Very confusing, Now I’m getting FileNotFoundError: [Errno 2] No such file or directory: ‘dog.png’ error, But I do have a dog.png file in my images folder

Hey @lhoestq I fixed that problem but the images and dataset are not in proper sequence, how to fix this? How to link images with each conversations in the JSON data?