Converting CLIP to CoreML

Hi, I’d like to convert CLIP model to CoreML but got some errors. Has anyone managed to do this? If someone can help me go in the right direction it would be great.

I thought of first converting the vision model. Here’s my code so far:

import coremltools as ct
import torch
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained(model_version)
visual_model = model.vision_model

# Trace the model with random data.
example_input_image = torch.rand(1, 3, 224, 224)

traced_model = torch.jit.trace(visual_model, example_input_image)
out = traced_model(example_input_image)

There’s a couple of issues:

  1. RuntimeError: Encountering a dict at the output of the tracer might cause the trace to be incorrect, this is only valid if the container structure does not change based on the module’s inputs. Consider using a constant container instead (e.g. for list, use a tuple instead. for dict, use a NamedTuple instead). If you absolutely need this and know the side effects, pass strict=False to trace() to allow this behavior.

  2. Warning: site-packages/transformers/models/clip/ TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can’t record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):

  3. Warning: /site-packages/transformers/models/clip/ TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can’t record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs! if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):

I’ll tag @pcuenq here

Thanks @radames! And hello, @alkibijad :slight_smile:

Regarding the first issue, you might avoid getting a dict in the output by passing return_dict = False to the invocation. I’m not sure this will be enough, as I think the default for the model is to return a tuple, see the documentation here.

The other warnings are frequent when tracing or compiling models. They are usually fine, as long as you use the model (in production) with inputs of the same shape as the one you used for tracing. The warning means that if you use a different shape, then the code path might be different and the traced code may no longer be correct.

I hope these are enough to get you started, please let us know how it goes!


Thanks for the initial guidance @pcuenq & @radames ! I definitely needed to read the docs a bit more, so I went through both CLIP’s docs and the CoreML docs.

I’ve managed to:

  • trace the model. The two warnings are still present, but shouldn’t have any impact.
  • Convert the model to CoreML
  • Run inference on the CoreML model

The standard and traced model produce exactly the same output, as they should. But the converted CoreML model produces slightly different output.

The conversion to CoreML yields no errors and warnings:

Converting PyTorch Frontend ==> MIL Ops: 100%|█████████▉| 696/697 [00:00<00:00, 4630.25 ops/s]
Running MIL Common passes: 100%|██████████| 40/40 [00:00<00:00, 96.36 passes/s]
Running MIL Clean up passes: 100%|██████████| 11/11 [00:00<00:00, 56.34 passes/s]
Translating MIL ==> NeuralNetwork Ops: 100%|██████████| 838/838 [00:54<00:00, 15.37 ops/s] 

When I pass the exact same image through the models, the CoreML has different results. Here’s how my code looks (simplified a bit):

# Load CLIP
processor = CLIPProcessor.from_pretrained(model_version)
model_pt = CLIPModel.from_pretrained(model_version)

# Trace
# wrapped_model -> wrapped CLIPModel so that forward() function returns get_image_features()
example_input = torch.rand(1, 3, 224, 224)
model_traced = torch.jit.trace(model_pt, example_input)

# Convert traced model to CoreML
model_coreml = ct.convert(
    inputs=[ct.TensorType(name="input_image", shape=example_input.shape)]

# Inference through all 3 models. Convert to numpy for easier comparison
image = ... # Load real image from path
processed_image = processor(text= None, images=[image], return_tensors="pt", padding=True)

res_pt = model_pt.get_image_features(processed_image).numpy()
res_traced = model_traced(processed_image).numpy()
res_coreml = model_coreml.predict({'input_image': processed_image.numpy()})['output_name']

# Compare outputs
print(np.array_equal(res_pt, res_traced)) # True -> standard and traced model produce the same results
print(np.array_equal(res_pt, res_coreml)) # False -> different output

# How close are the outputs?
print(np.allclose(res_pt, res_coreml, atol=1e-5)) # False
print(np.allclose(res_pt, res_coreml, atol=1e-2)) # False
print(np.allclose(res_pt, res_coreml, atol=1e-1)) # True -> Results differ ~0.1

Does anyone know what could cause this difference in output? The weights are stored as float32, so there’s no 32 → 16 bit conversion.

Hi @alkibijad! Glad to know that you are making progress :slight_smile:

Regarding your observation about the outputs, you are right: it is normal that the results are not numerically identical to the ones from PyTorch. Even if the weights are stored in 32-bit format, it doesn’t mean that the model has been converted using 32-bit precision, or runs with it.

The default conversion procedure is not guaranteed to happen in float-32, as described in this documentation. Furthermore, execution precision depends on the hardware you run your model on. Another factor that also adds some confusion is the conversion format (the legacy “Neural Network” format, vs the newer “ML Program”). If you convert to Neural Network, then inference precision will be 32-bit when the model (or portions of it) run on CPU, but 16-bit on GPU and Neural Engine. Using ML Program, you can also run 32-bit precision on GPU (but not on Neural Engine).

You can force conversion to happen in 32-bit mode. In ML-Program mode, you can even convert most of your operations using 16-bit precision but preserve some of them in 32-bit mode.

To verify the correctness of the conversion, you can first run the result on CPU, selecting the appropriate compute units, and measure the error with respect to the outputs from the original model, but without expecting numerical equivalence. This article suggests a signal-to-noise metric to measure the difference. Then you can run predictions incorporating GPU and/or NE, and measure the quality again. It’s usually ok to use conversion defaults and run in 16-bit mode on all eligible devices, but depending on your project (and the model) you might need to override some settings or resort to more exotic things such as per-op precision specification using typed tensors.

Let us know how it goes :slight_smile:


Thank you @pcuenq it’s nice of you to detail the answer that much! Much useful information in it.

The difference in the output has no impact in my use case, so I can say that the CLIP model can be successfully converted to CoreML :raised_hands:

1 Like