Datasets.map is not consistent with IterableDataset?

Agnellino · May 13, 2025, 3:00pm

Hi, I encounter an issue and I don’t know if I use badly the library or if there is a problem in it.

Here is the code :

from datasets import load_dataset


instruction = "Write the LaTeX representation for this image."

def convert_to_conversation(sample):
    conversation = [
        { "role": "user",
        "content" : [
            {"type" : "text",  "text"  : instruction},
            {"type" : "image", "image" : sample["image"]} ]
        },
        { "role" : "assistant",
        "content" : [
            {"type" : "text",  "text"  : sample["text"]} ]
        },
    ]
    return { "messages" : conversation }

dataset = load_dataset(
        "unsloth/LaTeX_OCR", 
        split="train", 
        streaming=True
    )

item = next(iter(dataset))
item_converted = convert_to_conversation(item)
print(item_converted)

The result is the following :

{'messages': [{'role': 'user', 'content': [{'type': 'text', 'text': 'Write the LaTeX representation for this image.'}, {'type': 'image', 'image': <PIL.PngImagePlugin.PngImageFile image mode=RGB size=160x40 at 0x7FE898A06D50>}]}, {'role': 'assistant', 'content': [{'type': 'text', 'text': '{ \\frac { N } { M } } \\in { \\bf Z } , { \\frac { M } { P } } \\in { \\bf Z } , { \\frac { P } { Q } } \\in { \\bf Z }'}]}]}

Whereas with the following code:

dataset = load_dataset(
        "unsloth/LaTeX_OCR", 
        split="train", 
        streaming=True
    )
dataset = dataset.map(convert_to_conversation)

item = next(iter(dataset))
print(item)

Then the result shown is the following (I truncated the bytes…):

{'image': {'bytes': b'\x89PNG\r\n\.....', 'path': '5208.png'}, 'text': '{ \\frac { N } { M } } \\in { \\bf Z } , { \\frac { M } { P } } \\in { \\bf Z } , { \\frac { P } { Q } } \\in { \\bf Z }', 'messages': [{'role': 'user', 'content': [{'type': 'text', 'text': 'Write the LaTeX representation for this image.'}, {'type': 'image', 'image': {'bytes': b'\x89PNG\r\n.....', 'path': '5208.png'}}]}, {'role': 'assistant', 'content': [{'type': 'text', 'text': '{ \\frac { N } { M } } \\in { \\bf Z } , { \\frac { M } { P } } \\in { \\bf Z } , { \\frac { P } { Q } } \\in { \\bf Z }'}]}]}

So basically, the image has been cast from a PIL Image into a datasets Image without being asked.

Is this normal ?

It seems that I can patch the type casting behavior by using datasets.Image.decode_example and remove additional keys from the dict using remove_colums=["image", "text"] in the map method.

Did I miss something ? What do I do wrong ?

Kind regards,

Tristan

John6666 · May 14, 2025, 7:41am

I think that’s probably normal…
https://stackoverflow.com/questions/74251282/huggingface-datasets-storing-and-loading-image-data

Topic		Replies	Views
[Bug/Feature] IterableDataset has different file opening with stream=True 🤗Datasets	2	452	November 2, 2022
How do I iterate through <class 'datasets.dataset_dict.IterableDatasetDict'>? Beginners	2	2936	January 15, 2024
Iterable datasets for array data, limited formatting options 🤗Datasets	2	422	December 28, 2023
Weird example of batching in Dataset.map document 🤗Datasets	4	1041	September 4, 2023
Slow DataLoader with big batch_size 🤗Datasets	4	1734	October 5, 2023

Datasets.map is not consistent with IterableDataset?

Related topics