The problem is caused when is_vqa is set to true, they use the pix2struct image processor automatically which requires an image input.
Normally, when the processor input doesn’t include an image, it automatically switch back to a simple text tokenizer from image processor, however when is_vqa is set to true in pix2struct auto processor, seems this automatically handling doesn’t work, the processor stick to image processor even your input is text only. Therefore, a simple solution is to use the tokenizer from processor by processor.tokenizer when you would like to tokenize the text inputs, which actually is the ground truth text for model’s output: text_inputs = processor.tokenizer(text=texts, padding="max_length", return_tensors="pt", add_special_tokens=True, max_length=20)
I made some change to the dataset and collator as following, which works correctly:
class ChartParametersDataset(Dataset):
def __init__(self, data_root) -> None:
...
def __len__(self):
return len(...)
def __getitem__(self, idx):
your_code_here
img = Image.open('yourimagehere.jpg').convert("RGB")
prompt = "Generate underlying data table of the figure below:"
text = "your ground truth output text"
inputs = {
"text":txt,
"prompt":prompt,
"image":img}
return inputs
processor = AutoProcessor.from_pretrained("google/deplot")
model = Pix2StructForConditionalGeneration.from_pretrained("google/deplot")
def collator(batch):
new_batch = {"flattened_patches":[], "attention_mask":[]}
texts = [item["text"] for item in batch]
images = [item["image"] for item in batch]
prompts = [item["prompt"] for item in batch]
text_inputs = processor.tokenizer(text=texts, padding="max_length", return_tensors="pt", add_special_tokens=True, max_length=20)
new_batch["labels"] = text_inputs.input_ids
encoding = processor(images=images, text=prompts,
return_tensors="pt", add_special_tokens=True,
max_patches=1024)
print(encoding)
new_batch["flattened_patches"] = encoding["flattened_patches"]
new_batch["attention_mask"] = encoding["attention_mask"]
return new_batch