Datasets.map is not consistent with IterableDataset?

Hi, I encounter an issue and I don’t know if I use badly the library or if there is a problem in it.

Here is the code :

from datasets import load_dataset


instruction = "Write the LaTeX representation for this image."

def convert_to_conversation(sample):
    conversation = [
        { "role": "user",
        "content" : [
            {"type" : "text",  "text"  : instruction},
            {"type" : "image", "image" : sample["image"]} ]
        },
        { "role" : "assistant",
        "content" : [
            {"type" : "text",  "text"  : sample["text"]} ]
        },
    ]
    return { "messages" : conversation }

dataset = load_dataset(
        "unsloth/LaTeX_OCR", 
        split="train", 
        streaming=True
    )

item = next(iter(dataset))
item_converted = convert_to_conversation(item)
print(item_converted)

The result is the following :

{'messages': [{'role': 'user', 'content': [{'type': 'text', 'text': 'Write the LaTeX representation for this image.'}, {'type': 'image', 'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=160x40 at 0x7FE898A06D50>}]}, {'role': 'assistant', 'content': [{'type': 'text', 'text': '{ \\frac { N } { M } } \\in { \\bf Z } , { \\frac { M } { P } } \\in { \\bf Z } , { \\frac { P } { Q } } \\in { \\bf Z }'}]}]}

Whereas with the following code:

dataset = load_dataset(
        "unsloth/LaTeX_OCR", 
        split="train", 
        streaming=True
    )
dataset = dataset.map(convert_to_conversation)

item = next(iter(dataset))
print(item)

Then the result shown is the following (I truncated the bytes…):

{'image': {'bytes': b'\x89PNG\r\n\.....', 'path': '5208.png'}, 'text': '{ \\frac { N } { M } } \\in { \\bf Z } , { \\frac { M } { P } } \\in { \\bf Z } , { \\frac { P } { Q } } \\in { \\bf Z }', 'messages': [{'role': 'user', 'content': [{'type': 'text', 'text': 'Write the LaTeX representation for this image.'}, {'type': 'image', 'image': {'bytes': b'\x89PNG\r\n.....', 'path': '5208.png'}}]}, {'role': 'assistant', 'content': [{'type': 'text', 'text': '{ \\frac { N } { M } } \\in { \\bf Z } , { \\frac { M } { P } } \\in { \\bf Z } , { \\frac { P } { Q } } \\in { \\bf Z }'}]}]}

So basically, the image has been cast from a PIL Image into a datasets Image without being asked.

Is this normal ?

It seems that I can patch the type casting behavior by using datasets.Image.decode_example and remove additional keys from the dict using remove_colums=["image", "text"] in the map method.

Did I miss something ? What do I do wrong ?

Kind regards,

Tristan

1 Like

I think that’s probably normal…
https://stackoverflow.com/questions/74251282/huggingface-datasets-storing-and-loading-image-data