Hello everyone! I would like to convert the CLIP model to ONNX format. I read in the documentation how to do it, and this is what happened:
My libs:
torch - 1.12.1
transformers - 4.23.1
onnxruntime - 1.11.1
onnxsimplifier - 0.4.8
Code:
import time
import PIL
import torch
import onnx
import onnxruntime as ort
from onnxsim import simplify
import transformers
import transformers.onnx
from transformers import CLIPModel, CLIPProcessor
import requests
import warnings
warnings.filterwarnings('ignore')
# Load processor from hub, but weights i have locally.
pt_model = CLIPModel.from_pretrained('./models/clip_model/category_clip_model', local_files_only=True)
processor = CLIPProcessor.from_pretrained('openai/clip-vit-base-patch32')
# Save to disk
processor.save_pretrained("local-pt-checkpoint")
pt_model.save_pretrained("local-pt-checkpoint")
# then i converting to onnx with transformers.onnx
!python -m transformers.onnx --model=local-pt-checkpoint onnx/
import requests
session = ort.InferenceSession("onnx/model.onnx")
image = Image.open(requests.get("http://images.cocodataset.org/val2017/000000039769.jpg", stream=True).raw)
inputs = processor(text=["a photo of a cat"], images=image, return_tensors="np", padding=True)
outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs))
# then i got an error (screenshot 2)
Also after converting(with transformers.onnx) I got that kind of output. It seems strange because of 3 input channels instead of two (image and text)