ValueError: Invalid image type. Expected either PIL.Image.Image, numpy.ndarray, torch.Tensor, tf.Tensor or jax.ndarray, but got

shreyanshu09 · July 18, 2023, 6:05am

I am trying to train the Deplot model using Huggingface Library but facing Value Error.

I follow the pix2struct notebook as suggested in Deplot Code.
https://github.com/huggingface/notebooks/blob/main/examples/image_captioning_pix2struct.ipynb

I have successfully trained pix2struct-base models using the same data format (image and text) without encountering any issues. However, I face this particular problem when training the “Deplot” model.

My Dataset format:

Dataset({
features: [‘image’, ‘text’],
num_rows: 14289
})

The code below displays “True”, meaning the format is correct.

print(isinstance(dataset[0][‘image’], Image.Image))

Code I used:

from torch.utils.data import Dataset, DataLoader

MAX_PATCHES = 1024

class ImageCaptioningDataset(Dataset):
def init(self, dataset, processor):
self.dataset = dataset
self.processor = processor
def __len__(self):
    return len(self.dataset)

def __getitem__(self, idx):
    item = self.dataset[idx]
    encoding = self.processor(images=item["image"], text="Generate underlying data table of the figure below:", return_tensors="pt", add_special_tokens=True, max_patches=MAX_PATCHES)
    encoding = {k:v.squeeze() for k,v in encoding.items()}
    encoding["text"] = item["text"]
    return encoding
from transformers import AutoProcessor, Pix2StructForConditionalGeneration

processor = AutoProcessor.from_pretrained(“google/deplot”)
model = Pix2StructForConditionalGeneration.from_pretrained(“google/deplot”)

import torch

def collator(batch):
new_batch = {“flattened_patches”:, “attention_mask”:}
texts = [item[“text”] for item in batch]
text_inputs = processor(text=texts, padding="max_length", truncation=True, return_tensors="pt", add_special_tokens=True, max_length=512)

new_batch["labels"] = text_inputs.input_ids

for item in batch:
    new_batch["flattened_patches"].append(item["flattened_patches"])
    new_batch["attention_mask"].append(item["attention_mask"])

new_batch["flattened_patches"] = torch.stack(new_batch["flattened_patches"])
new_batch["attention_mask"] = torch.stack(new_batch["attention_mask"])

return new_batch
train_dataset = ImageCaptioningDataset(dataset, processor)
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=2, collate_fn=collator)

import torch
from transformers.optimization import Adafactor, get_cosine_schedule_with_warmup
import os

EPOCHS = 5000

optimizer = Adafactor(model.parameters(), scale_parameter=False, relative_step=False, lr=0.01, weight_decay=1e-05)
scheduler = get_cosine_schedule_with_warmup(optimizer, num_warmup_steps=1000, num_training_steps=40000)

device = torch.device(“cuda:2”) if torch.cuda.is_available() else “cpu”
model.to(device)

model.train()

for epoch in range(EPOCHS):
print(“Epoch:”, epoch)
for idx, batch in enumerate(train_dataloader): 
    labels = batch.pop("labels").to(device)
    flattened_patches = batch.pop("flattened_patches").to(device)
    attention_mask = batch.pop("attention_mask").to(device)

    outputs = model(flattened_patches=flattened_patches,
                    attention_mask=attention_mask,
                    labels=labels)
    
    loss = outputs.loss

    print("Loss:", loss.item())

    loss.backward()

    optimizer.step()
    optimizer.zero_grad()
    scheduler.step()  # Update the learning rate scheduler

    if (epoch + 1) % 20 == 0:
        model.eval()

        predictions = model.generate(flattened_patches=flattened_patches, attention_mask=attention_mask)        
        print("Predictions:", processor.batch_decode(predictions, skip_special_tokens=True))

        model.train()

Error I am facing:

ValueError Traceback (most recent call last) Cell In[20], line 18 15 for epoch in range(EPOCHS): 16 print(“Epoch:”, epoch) —> 18 for idx, batch in enumerate(train_dataloader): 19 labels = batch.pop(“labels”).to(device) 20 flattened_patches = batch.pop(“flattened_patches”).to(device) File ~/anaconda3/envs/deplot_3/lib/python3.9/site-packages/torch/utils/data/dataloader.py:652, in _BaseDataLoaderIter.next(self) 649 if self._sampler_iter is None: 650 # TODO(Bug in dataloader iterator found by mypy · Issue #76750 · pytorch/pytorch · GitHub) 651 self._reset() # type: ignore[call-arg] → 652 data = self._next_data() 653 self._num_yielded += 1 654 if self._dataset_kind == _DatasetKind.Iterable and \ 655 self._IterableDataset_len_called is not None and \ 656 self._num_yielded > self._IterableDataset_len_called: File ~/anaconda3/envs/deplot_3/lib/python3.9/site-packages/torch/utils/data/dataloader.py:692, in _SingleProcessDataLoaderIter._next_data(self) 690 def _next_data(self): 691 index = self._next_index() # may raise StopIteration → 692 data = self._dataset_fetcher.fetch(index) # may raise StopIteration 693 if self._pin_memory: 694 data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)

…

133 “Invalid image type. Expected either PIL.Image.Image, numpy.ndarray, torch.Tensor, tf.Tensor or " 134 f"jax.ndarray, but got {type(images)}.” 135 )
ValueError: Invalid image type. Expected either PIL.Image.Image, numpy.ndarray, torch.Tensor, tf.Tensor or jax.ndarray, but got .

rlee002 · July 24, 2023, 5:34am

Same question here! My guess is that since our new deplot processor aggregates both the bert-tokenizer processor and the pix2struct processor, it requires ‘images=’ parameter as used in the getitem method from the Dataset class but I have no idea what the images should be in the collator function

miikatoi · August 1, 2023, 1:44pm

This pointed me towards a solution.

When running from the notebook, images and text are processed separately, images in ImageCaptioningDataset.__call__ and text in collator. Therefore processor does not get an image input when processing the text. However, since processor.image_processor.is_vqa is True, it expects an image.

In processor, the input should be treated as text only, but the conditional evaluates to False due to is_vqa, and self.image_processor(images, ...) gets called with None, causing the error message.

You can do a quick fix by setting processor.image_processor.is_vqa = False before iterating the dataloader.

shreyanshu09 · August 2, 2023, 1:04am

Thank you, It worked

Pengyu965 · January 3, 2024, 4:37pm

But from my understanding, the deplot pretraining render the input text “generate the underlying data” as a header to the input images. If we pass the processor.image_processor.is_vqa = False, the image wouldn’t get preprocessed correctly since the render_header will not be called. Therefore, leading to the incorrect result of input process.

So, I don’t think this is a proper solution. Still exploring the correct solutions, will post later

Pengyu965 · January 5, 2024, 9:42pm

The problem is caused when is_vqa is set to true, they use the pix2struct image processor automatically which requires an image input.

Normally, when the processor input doesn’t include an image, it automatically switch back to a simple text tokenizer from image processor, however when is_vqa is set to true in pix2struct auto processor, seems this automatically handling doesn’t work, the processor stick to image processor even your input is text only. Therefore, a simple solution is to use the tokenizer from processor by processor.tokenizer when you would like to tokenize the text inputs, which actually is the ground truth text for model’s output: text_inputs = processor.tokenizer(text=texts, padding="max_length", return_tensors="pt", add_special_tokens=True, max_length=20)

I made some change to the dataset and collator as following, which works correctly:

class ChartParametersDataset(Dataset):
    def __init__(self, data_root) -> None:
        ...
    
    def __len__(self):
        return len(...)
    
    def __getitem__(self, idx):
        your_code_here
        img = Image.open('yourimagehere.jpg').convert("RGB")
        prompt = "Generate underlying data table of the figure below:"
        text = "your ground truth output text"
        
        inputs = {
            "text":txt,
            "prompt":prompt,
            "image":img}
        return inputs

processor = AutoProcessor.from_pretrained("google/deplot")
model = Pix2StructForConditionalGeneration.from_pretrained("google/deplot")

def collator(batch):
    new_batch = {"flattened_patches":[], "attention_mask":[]}
    texts = [item["text"] for item in batch]
    images = [item["image"] for item in batch]
    prompts = [item["prompt"] for item in batch]
  
    text_inputs = processor.tokenizer(text=texts, padding="max_length", return_tensors="pt", add_special_tokens=True, max_length=20)
  
    new_batch["labels"] = text_inputs.input_ids
  
    encoding = processor(images=images, text=prompts,
                         return_tensors="pt", add_special_tokens=True, 
                         max_patches=1024)

    print(encoding)

    new_batch["flattened_patches"] = encoding["flattened_patches"]
    new_batch["attention_mask"] = encoding["attention_mask"]

    return new_batch

system · January 9, 2024, 2:18am

This topic was automatically closed 12 hours after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Could not fine-tune deplot model Models	3	483	January 10, 2024
Pytorch tokenizer unable to create tensor error Models	0	580	July 24, 2023
Trainer API provides empty batches Beginners	0	480	December 26, 2023
For google/deplot, what should I input as header text for fine-tuning? Models	7	1420	May 11, 2023
ValueError: too many values to unpack (expected 3) using the DETR model 🤗Transformers	1	648	July 10, 2024

ValueError: Invalid image type. Expected either PIL.Image.Image, numpy.ndarray, torch.Tensor, tf.Tensor or jax.ndarray, but got

Related topics