Hi, I encounter an issue and I don’t know if I use badly the library or if there is a problem in it.
Here is the code :
from datasets import load_dataset
instruction = "Write the LaTeX representation for this image."
def convert_to_conversation(sample):
conversation = [
{ "role": "user",
"content" : [
{"type" : "text", "text" : instruction},
{"type" : "image", "image" : sample["image"]} ]
},
{ "role" : "assistant",
"content" : [
{"type" : "text", "text" : sample["text"]} ]
},
]
return { "messages" : conversation }
dataset = load_dataset(
"unsloth/LaTeX_OCR",
split="train",
streaming=True
)
item = next(iter(dataset))
item_converted = convert_to_conversation(item)
print(item_converted)
The result is the following :
{'messages': [{'role': 'user', 'content': [{'type': 'text', 'text': 'Write the LaTeX representation for this image.'}, {'type': 'image', 'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=160x40 at 0x7FE898A06D50>}]}, {'role': 'assistant', 'content': [{'type': 'text', 'text': '{ \\frac { N } { M } } \\in { \\bf Z } , { \\frac { M } { P } } \\in { \\bf Z } , { \\frac { P } { Q } } \\in { \\bf Z }'}]}]}
Whereas with the following code:
dataset = load_dataset(
"unsloth/LaTeX_OCR",
split="train",
streaming=True
)
dataset = dataset.map(convert_to_conversation)
item = next(iter(dataset))
print(item)
Then the result shown is the following (I truncated the bytes…):
{'image': {'bytes': b'\x89PNG\r\n\.....', 'path': '5208.png'}, 'text': '{ \\frac { N } { M } } \\in { \\bf Z } , { \\frac { M } { P } } \\in { \\bf Z } , { \\frac { P } { Q } } \\in { \\bf Z }', 'messages': [{'role': 'user', 'content': [{'type': 'text', 'text': 'Write the LaTeX representation for this image.'}, {'type': 'image', 'image': {'bytes': b'\x89PNG\r\n.....', 'path': '5208.png'}}]}, {'role': 'assistant', 'content': [{'type': 'text', 'text': '{ \\frac { N } { M } } \\in { \\bf Z } , { \\frac { M } { P } } \\in { \\bf Z } , { \\frac { P } { Q } } \\in { \\bf Z }'}]}]}
So basically, the image has been cast from a PIL Image into a datasets Image without being asked.
Is this normal ?
It seems that I can patch the type casting behavior by using datasets.Image.decode_example
and remove additional keys from the dict using remove_colums=["image", "text"]
in the map
method.
Did I miss something ? What do I do wrong ?
Kind regards,
Tristan