So I was reading this blog Vision Language Models Explained on fine-tuning LLaVA from a dataset from HF, and I was wondering - How can I upload my custom dataset on HF to do that same with my data? Can anyone help me out with that?
Currently, I have an images folder that contains all the images and have a file named data.json in my root folder. How will I process would look like? I already read some tutorials on docs.
data.json:
[
{
"role": "user",
"content": [
{ "type": "text", "text": "Whatâs in this image?" },
{ "type": "image" }
]
},
{ "role": "assistant", "content": [{ "type": "text", "text": "DOG" }] },
{
"role": "user",
"content": [
{ "type": "text", "text": "Whatâs in this image?" },
{ "type": "image" }
]
},
{ "role": "assistant", "content": [{ "type": "text", "text": "CAT" }] },
{
"role": "user",
"content": [
{ "type": "text", "text": "Whatâs in this image?" },
{ "type": "image" }
]
},
{ "role": "assistant", "content": [{ "type": "text", "text": "DUCK" }] }
]
1 Like
In the blog they use the dataset at HuggingFaceH4/llava-instruct-mix-vsft ¡ Datasets at Hugging Face, so you should use the same format. I notice your already have the JSON part in the right format, you just need to associate the JSON part with the images.
You can do so programmatically:
from datasets import Dataset, Image, Sequence
ds = Dataset.from_dict({"messages": your_data_json_messages_list})
ds = ds.add_column("images", [[img_path] for img_path in img_paths]) # contains lists of 1 image
ds = ds.cast_column("images", Sequence(Image())) # from string type to image type
# Optionally push to the HF Hub
ds.push_to_hub("username/datasetname")
Getting this error âValueError: Failed to concatenate on axis=1 because tables donât have the same number of rowsâ
import os
import json
from datasets import Dataset, Image, Sequence
with open("data.json", "r") as f:
your_data_json = json.load(f)
img_folder = "images/"
img_paths = [os.path.join(img_folder, img_name) for img_name in os.listdir(img_folder) if img_name.endswith(('.jpg', '.jpeg', '.png'))]
ds = Dataset.from_dict({"messages": your_data_json})
ds = ds.add_column("images", [[img_path] for img_path in img_paths])
ds = ds.cast_column("images", Sequence(Image()))
ds.push_to_hub("marksuccsmfewercoc/test")
You should have as many images as items in your messages list for it to work. In particular you can make sure each item in the messages list is a list of one user message and one assistant message.
But the LLava dataset and mine dataset look very different how to fix it?
here is what the correct dataset looks like HuggingFaceH4/llava-instruct-mix-vsft ¡ Datasets at Hugging Face and here is what my dataset looks like marksuccsmfewercoc/test ¡ Datasets at Hugging Face
The messages
column should contain lists (one list per image), for example the first list can look like
[
{
"content": [ { "index": null, "text": "Whatâs in this image?\n", "type": "text" }, { "index": 0, "text": null, "type": "image" } ],
"role": "user"
},
{
"content": [ { "index": null, "text": "DOG", "type": "text" } ],
"role": "assistant"
}
]
So what does my JSON file look like? before uploading it to the hub
Can you please tell me whatâs the correct format to make it work with fine-tuning LLaVA according to the blog by HF?
I just have three images in my images folder.
Hey @lhoestq Iâm still getting âraise ValueError(âFailed to concatenate on axis=1 because tables donât have the same number of rowsâ) ValueError: Failed to concatenate on axis=1 because tables donât have the same number of rowsâ.
But when I print the length of images and JSON it prints 3 and 3 which is correct. Any idea about what can be the issue?
import os
import json
from datasets import Dataset, Image, Sequence
with open("data.json", "r") as f:
your_data_json = json.load(f)
print(len(your_data_json))
print(len(os.listdir('images/')))
ds = Dataset.from_dict({"messages": your_data_json})
ds = ds.add_column("images", [[img_path] for img_path in 'images/'])
ds = ds.cast_column("images", Sequence(Image()))
ds.push_to_hub("marksuccsmfewercoc/test")
I donât understand if both len(your_data_json) value and (len(os.listdir(âimages/â)) are 3 then whatâs the issue?
Hi ! I think there is a small bug in your code:
- ds = ds.add_column("images", [[img_path] for img_path in 'images/'])
+ ds = ds.add_column("images", [[img_path] for img_path in os.listdir('images/')])
Very confusing, Now Iâm getting FileNotFoundError: [Errno 2] No such file or directory: âdog.pngâ error, But I do have a dog.png file in my images folder
Hey @lhoestq I fixed that problem but the images and dataset are not in proper sequence, how to fix this? How to link images with each conversations in the JSON data?