Troubles with converting CLIP to ONNX

Hello everyone! I would like to convert the CLIP model to ONNX format. I read in the documentation how to do it, and this is what happened:

My libs:
torch - 1.12.1
transformers - 4.23.1
onnxruntime - 1.11.1
onnxsimplifier - 0.4.8

Code:

import time
import PIL
import torch
import onnx
import onnxruntime as ort
from onnxsim import simplify
import transformers
import transformers.onnx
from transformers import CLIPModel, CLIPProcessor
import requests

import warnings
warnings.filterwarnings('ignore')


# Load processor from hub, but weights i have locally.
pt_model = CLIPModel.from_pretrained('./models/clip_model/category_clip_model', local_files_only=True)
processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch32')

# Save to disk
processor.save_pretrained("local-pt-checkpoint")
pt_model.save_pretrained("local-pt-checkpoint")

# then i converting to onnx with transformers.onnx
!python -m transformers.onnx --model=local-pt-checkpoint onnx/

import requests
session = ort.InferenceSession("onnx/model.onnx")
image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
inputs = processor(text=["a photo of a cat"], images=image, return_tensors="np", padding=True)
outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs))
# then i got an error (screenshot 2)

Also after converting(with transformers.onnx) I got that kind of output. It seems strange because of 3 input channels instead of two (image and text)

Could you include the command you used to convert to ONNX?

you can try:

inputs = processor(text=["a photo of two cat"], images=image, return_tensors="np", padding=True)
session.run(input_feed=dict(inputs), output_names=["logits_per_image"])

which returns logits_per_image for each of the text inputs

1 Like

are you looking for this:

!python -m transformers.onnx --model=local-pt-checkpoint onnx/