Converting CLIP to CoreML

Hi, I’d like to convert CLIP model to CoreML but got some errors. Has anyone managed to do this? If someone can help me go in the right direction it would be great.

I thought of first converting the vision model. Here’s my code so far:

import coremltools as ct
import torch
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained(model_version)
visual_model = model.vision_model
visual_model.eval()


# Trace the model with random data.
example_input_image = torch.rand(1, 3, 224, 224)

traced_model = torch.jit.trace(visual_model, example_input_image)
out = traced_model(example_input_image)

There’s a couple of issues:

  1. RuntimeError: Encountering a dict at the output of the tracer might cause the trace to be incorrect, this is only valid if the container structure does not change based on the module’s inputs. Consider using a constant container instead (e.g. for list, use a tuple instead. for dict, use a NamedTuple instead). If you absolutely need this and know the side effects, pass strict=False to trace() to allow this behavior.

  2. Warning: site-packages/transformers/models/clip/modeling_clip.py:222: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can’t record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):

  3. Warning: /site-packages/transformers/models/clip/modeling_clip.py:262: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can’t record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):

I’ll tag @pcuenq here

Thanks @radames! And hello, @alkibijad :slight_smile:

Regarding the first issue, you might avoid getting a dict in the output by passing return_dict = False to the invocation. I’m not sure this will be enough, as I think the default for the model is to return a tuple, see the documentation here.

The other warnings are frequent when tracing or compiling models. They are usually fine, as long as you use the model (in production) with inputs of the same shape as the one you used for tracing. The warning means that if you use a different shape, then the code path might be different and the traced code may no longer be correct.

I hope these are enough to get you started, please let us know how it goes!

2 Likes

Thanks for the initial guidance @pcuenq & @radames ! I definitely needed to read the docs a bit more, so I went through both CLIP’s docs and the CoreML docs.

I’ve managed to:

  • trace the model. The two warnings are still present, but shouldn’t have any impact.
  • Convert the model to CoreML
  • Run inference on the CoreML model

The standard and traced model produce exactly the same output, as they should. But the converted CoreML model produces slightly different output.

The conversion to CoreML yields no errors and warnings:

Converting PyTorch Frontend ==> MIL Ops: 100%|█████████▉| 696/697 [00:00<00:00, 4630.25 ops/s]
Running MIL Common passes: 100%|██████████| 40/40 [00:00<00:00, 96.36 passes/s]
Running MIL Clean up passes: 100%|██████████| 11/11 [00:00<00:00, 56.34 passes/s]
Translating MIL ==> NeuralNetwork Ops: 100%|██████████| 838/838 [00:54<00:00, 15.37 ops/s] 

When I pass the exact same image through the models, the CoreML has different results. Here’s how my code looks (simplified a bit):

# Load CLIP
processor = CLIPProcessor.from_pretrained(model_version)
model_pt = CLIPModel.from_pretrained(model_version)
model_pt.eval()

# Trace
# wrapped_model -> wrapped CLIPModel so that forward() function returns get_image_features()
example_input = torch.rand(1, 3, 224, 224)
model_traced = torch.jit.trace(model_pt, example_input)

# Convert traced model to CoreML
model_coreml = ct.convert(
    model_traced,
    inputs=[ct.TensorType(name="input_image", shape=example_input.shape)]
)

# Inference through all 3 models. Convert to numpy for easier comparison
image = ... # Load real image from path
processed_image = processor(text= None, images=[image], return_tensors="pt", padding=True)

res_pt = model_pt.get_image_features(processed_image).numpy()
res_traced = model_traced(processed_image).numpy()
res_coreml = model_coreml.predict({'input_image': processed_image.numpy()})['output_name']


# Compare outputs
print(np.array_equal(res_pt, res_traced)) # True -> standard and traced model produce the same results
print(np.array_equal(res_pt, res_coreml)) # False -> different output

# How close are the outputs?
print(np.allclose(res_pt, res_coreml, atol=1e-5)) # False
print(np.allclose(res_pt, res_coreml, atol=1e-2)) # False
print(np.allclose(res_pt, res_coreml, atol=1e-1)) # True -> Results differ ~0.1

Does anyone know what could cause this difference in output? The weights are stored as float32, so there’s no 32 → 16 bit conversion.

Hi @alkibijad! Glad to know that you are making progress :slight_smile:

Regarding your observation about the outputs, you are right: it is normal that the results are not numerically identical to the ones from PyTorch. Even if the weights are stored in 32-bit format, it doesn’t mean that the model has been converted using 32-bit precision, or runs with it.

The default conversion procedure is not guaranteed to happen in float-32, as described in this documentation. Furthermore, execution precision depends on the hardware you run your model on. Another factor that also adds some confusion is the conversion format (the legacy “Neural Network” format, vs the newer “ML Program”). If you convert to Neural Network, then inference precision will be 32-bit when the model (or portions of it) run on CPU, but 16-bit on GPU and Neural Engine. Using ML Program, you can also run 32-bit precision on GPU (but not on Neural Engine).

You can force conversion to happen in 32-bit mode. In ML-Program mode, you can even convert most of your operations using 16-bit precision but preserve some of them in 32-bit mode.

To verify the correctness of the conversion, you can first run the result on CPU, selecting the appropriate compute units, and measure the error with respect to the outputs from the original model, but without expecting numerical equivalence. This article suggests a signal-to-noise metric to measure the difference. Then you can run predictions incorporating GPU and/or NE, and measure the quality again. It’s usually ok to use conversion defaults and run in 16-bit mode on all eligible devices, but depending on your project (and the model) you might need to override some settings or resort to more exotic things such as per-op precision specification using typed tensors.

Let us know how it goes :slight_smile:

2 Likes

Thank you @pcuenq it’s nice of you to detail the answer that much! Much useful information in it.

The difference in the output has no impact in my use case, so I can say that the CLIP model can be successfully converted to CoreML :raised_hands:

2 Likes

@alkibijad Thanks for posting your findings while you went through the conversion process!

Would you be able to post a snippet of the finished code that you were able to use to do the conversion to coreml? I’m attempting to do the same and have ran into a few issues.

For example, if I try converting the model using the code below, I get the following error:

# Load CLIP
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")
model_pt = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
model_pt.eval()

# Trace
example_input = torch.rand(1, 3, 224, 224)
model_traced = torch.jit.trace(model_pt, example_input)

# Convert traced model to CoreML
model_coreml = ct.convert(
    model_traced,
    inputs=[ct.TensorType(name="input_image", shape=example_input.shape)]
)

Error:

ValueError: You have to specify pixel_values

Made some progress. If I run the below code I can get through the first step.

Converting PyTorch Frontend ==> MIL Ops: 100%|█████████▉| 1379/1380 [00:01<00:00, 1155.22 ops/s]
# Load CLIP
model_pt = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
visual_model = model_pt.vision_model
visual_model.eval()

example_input_image = torch.rand(1, 3, 224, 224)
model_traced = torch.jit.trace(visual_model, example_input_image, strict=False)

# Convert the traced model to CoreML
coreml_model = ct.convert(
    model_traced,
    inputs=[ct.TensorType(name="input_image", shape=example_input_image.shape)]
)

Unfortunately, I get the following error:

RuntimeError: PyTorch convert function for op ‘dictconstruct’ not implemented.

I have this same error. Did you figure out how to fix it?

Sorry for late reply. I had issues with converting the vision model only. The hack/solution was to trace the entire CLIP model with only image as input.

Also, if you don’t need the output of full CLIP model (which is a dictionary/tuple I think), you can wrap CLIP into your own model, whose forward pass calls clip_model.get_image_features(x). That way you’re sure to not have any dictionaries that could cause conversion issues.

Hey
Did you fix the issue?
Cause i faced the same problem
Thank you

Thank you so much

I found a solution. I entered the source code file of CLIPVisionTransformer and located the “forward” function. I forcefully set the “return_dict” variable to False.

2 Likes

I know this thread is a bit old at this point, but in case you are like me and trying to do this today and struggling to run these code snippets without error, this solution code snippet is so close. A super easy way to accomplish the

# wrapped_model -> wrapped CLIPModel so that forward() function returns get_image_features()

line in this code snippet is to hack the forward method so that it always calls the get_image_features method instead, like so:

model_pt.forward = lambda *args, **kwargs: model_pt.get_image_features(*args, **kwargs)

If you add that, this entire code snippet posted in the solution should work properly.

In addition, if you also want to see how the text encoder could be converted to CoreML, I found this repo to be super helpful: GitHub - mazzzystar/Queryable: Run OpenAI's CLIP model on iOS to search photos.