Is there a way to include the text_projection and/or embedding normalization in an optimum-optimized CLIPTextModelWithProjection?

transformers includes a modification of CLIPTextModel that includes the text_projection done by the full CLIPModel (CLIPTextModelWithProjection), but as far as I can tell there’s not an easy way to package that model with optimum. Is there a simple way to do this?

What I’ve tried
Using the optimum.exporters.onnx.convert.export function to export an instance of CLIPTextModelWithProjection, I wanted to try to simply change the output_names that get passed to pytorch.onnx.export, but the CLIPTextOnnxConfig specifies its outputs to only include the pre-projection outputs. (and I can’t just re-assign my own outputs values to the config, unfortunately).

I figured I could make a contribution, and either add a new OnnxConfig subclass or parameterize the outputs property somehow, but both ClipTextModel and CLIPTextModelWithProjection have the model_type “clip-text-model”, so I’m not sure what the best way would be to parameterize the OnnxConfig to allow for either output (if there’s a natural way to do this, I’d be more than happy to make a PR).

Since the optimum.exporters.onnx.convert.export function only exposes the config as a param, and not the output names explicitly (which makes sense, but this keeps me from overriding the outputs), I figured I could maybe use another function to export, but I haven’t found an appropriate one. Maybe this is this enough of an edge case that I should just be using pytorch’s onnx export utils directly? I’m hoping there’s something more optimum-supported though since transformers has the CLIPTextModelWithProjection model, but I understand if not.

Finally, I could just use the “text_embeds” output from the full CLIP model, but I was hoping to limit the model size I’m carrying around since all I need is text in this case. If using the full model is the recommended way around this, I (again) understand, but wanted to ask if there’s some supported way to get an ONNX model that only produces the projected (and ideally also normalized) text embeddings.

Hi @pakelley, interesting case! As you said, the fact that the model type is the same for both makes it hard to manage.

However, I think you were on the right path with this:

Since the optimum.exporters.onnx.convert.export function only exposes the config as a param, and not the output names explicitly (which makes sense, but this keeps me from overriding the outputs)

Why not overriding CLIPTextOnnxConfig as follows:

from optimum.exporters.onnx.model_configs import CLIPTextOnnxConfig

class CLIPTextModelWithProjectionOnnxConfig(CLIPTextOnnxConfig):
    def outputs(self) -> Dict[str, Dict[int, str]]:
        return {
            "text_embeds": {0: "batch_size"},
            "last_hidden_state": {0: "batch_size", 1: "sequence_length"},

and giving an instance of this new ONNX config class as an argument to the optimum.exporters.onnx.convert.export function?

Fantastic, this works! Seems sorta obvious in retrospect, but thanks for bearing with me.

Would this be something worth adding to optimum? I’m happy to make a PR if so, but if it just seems like more to maintain no worries.

I also had a sort of tangential question that I haven’t found a great answer to, but I’m happy to start a new thread if that’s better:
What’s the best way to create a Pipeline for a model like this? The issue I’m running into is that the types of the output don’t end up lining up for some reason (the input_ids seem to be int64’s, when the model expects int32’s).

Here’s what I’ve tried (using a model exported with the CLIPTextModelWithProjectionOnnxConfig(CLIPTextOnnxConfig you suggested above):

from transformers import pipeline
from transformers import CLIPTextConfig
from optimum.onnxruntime import ORTModelForCustomTasks

processor = CLIPProcessor.from_pretrained(model_path)
model_config = CLIPTextConfig.from_json_file(model_path / "config.json")
model = ORTModelForCustomTasks.from_pretrained(model_path, config=model_config)

onnx_extractor = pipeline("feature-extraction", model=model, tokenizer=processor)
onnx_extractor(["test query"])

Which gives me the following error:

/path/to/my/project/__pypackages__/3.10/lib/transformers/models/clip/ FutureWarning: The class CLIPFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use CLIPImageProcessor instead.
InvalidArgument                           Traceback (most recent call last)
Cell In[51], line 11
      8 model = ORTModelForCustomTasks.from_pretrained(model_path, config=model_config)
     10 onnx_extractor = pipeline("feature-extraction", model=model, tokenizer=processor)
---> 11 onnx_extractor(["test query"])

File /path/to/my/project/__pypackages__/3.10/lib/transformers/pipelines/, in FeatureExtractionPipeline.__call__(self, *args, **kwargs)
     97 def __call__(self, *args, **kwargs):
     98     """
     99     Extract the features of the input(s).
    105         A nested list of `float`: The features computed by the model.
    106     """
--> 107     return super().__call__(*args, **kwargs)

File /path/to/my/project/__pypackages__/3.10/lib/transformers/pipelines/, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
   1096 if can_use_iterator:
   1097     final_iterator = self.get_iterator(
   1098         inputs, num_workers, batch_size, preprocess_params, forward_params, postprocess_params
   1099     )
-> 1100     outputs = list(final_iterator)
   1101     return outputs
   1102 else:

File /path/to/my/project/__pypackages__/3.10/lib/transformers/pipelines/, in PipelineIterator.__next__(self)
    121     return self.loader_batch_item()
    123 # We're out of items within a batch
--> 124 item = next(self.iterator)
    125 processed = self.infer(item, **self.params)
    126 # We now have a batch of "inferred things".

File /path/to/my/project/__pypackages__/3.10/lib/transformers/pipelines/, in PipelineIterator.__next__(self)
    123 # We're out of items within a batch
    124 item = next(self.iterator)
--> 125 processed = self.infer(item, **self.params)
    126 # We now have a batch of "inferred things".
    127 if self.loader_batch_size is not None:
    128     # Try to infer the size of the batch

File /path/to/my/project/__pypackages__/3.10/lib/transformers/pipelines/, in Pipeline.forward(self, model_inputs, **forward_params)
   1023     with inference_context():
   1024         model_inputs = self._ensure_tensor_on_device(model_inputs, device=self.device)
-> 1025         model_outputs = self._forward(model_inputs, **forward_params)
   1026         model_outputs = self._ensure_tensor_on_device(model_outputs, device=torch.device("cpu"))
   1027 else:

File /path/to/my/project/__pypackages__/3.10/lib/transformers/pipelines/, in FeatureExtractionPipeline._forward(self, model_inputs)
     84 def _forward(self, model_inputs):
---> 85     model_outputs = self.model(**model_inputs)
     86     return model_outputs

File /path/to/my/project/__pypackages__/3.10/lib/optimum/optimum/, in OptimizedModel.__call__(self, *args, **kwargs)
     84 def __call__(self, *args, **kwargs):
---> 85     return self.forward(*args, **kwargs)

File /path/to/my/project/__pypackages__/3.10/lib/optimum/optimum/onnxruntime/, in ORTModelForCustomTasks.forward(self, **kwargs)
   2140 onnx_inputs = self._prepare_onnx_inputs(use_torch=use_torch, **kwargs)
   2142 # run inference
-> 2143 onnx_outputs =, onnx_inputs)
   2144 outputs = self._prepare_onnx_outputs(onnx_outputs, use_torch=use_torch)
   2146 # converts output to namedtuple for pipelines post-processing

File /path/to/my/project/__pypackages__/3.10/lib/onnxruntime/capi/, in, output_names, input_feed, run_options)
    215     output_names = [ for output in self._outputs_meta]
    216 try:
--> 217     return, input_feed, run_options)
    218 except C.EPFail as err:
    219     if self._enable_fallback:

InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Unexpected input data type. Actual: (tensor(int64)) , expected: (tensor(int32))

Is there a simple way to do this conversion in the pipeline? I can make it work by using the processor/model manually, and converting the input_ids myself, but it would be nice to have a single Pipeline object that handles this all for me, if possible.

What could make sense is to add this ONNX config in optimum/optimum/exporters/onnx/ at main · huggingface/optimum · GitHub so that it is available for everybody to use with the optimum.exporters.onnx.convert.export method. But I don’t think we want to make the CLI more complex for this edge case.
WDYT @fxmarty @Jingya @michaelbenayoun @echarlaix ?

Regarding this, it seems to be linked to ORTStableDiffusionPipeline: INVALID_ARGUMENT : Unexpected input data type. Actual: (tensor(int64)) , expected: (tensor(int32)) · Issue #994 · huggingface/optimum · GitHub.
Which version of torch are you running?

I’m on torch 1.13.1, looks like I have the reverse of the issue there in case it matters. I had some incompatibility issues with 2.0 that I haven’t gotten around to resolving yet.
I’m happy to continue the conversation in that issue if it seems more appropriate to you though.
Thanks for your help!

1 Like

For what it’s worth, I tried using torch 1.12 and got the same error

Ok, seems to work on torch 2.0! But I can’t create my pipeline haha.
This works:

normalized_text = preprocessor(text=["testing"])

But when I try to create my pipeline (even if I specify the framework):

pipe = pipeline("feature-extraction", model=model, tokenizer=preprocessor, framework="pt")

I get this error:

TypeError                                 Traceback (most recent call last)
Cell In[20], line 3
      1 from transformers import pipeline
----> 3 pipe = pipeline("feature-extraction", model=model, tokenizer=preprocessor, framework="pt")

File /path/to/my/project/.venv/lib/python3.10/site-packages/transformers/pipelines/, in pipeline(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, use_auth_token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
    784 # Infer the framework from the model
    785 # Forced if framework already defined, inferred if it's None
    786 # Will load the correct model if possible
    787 model_classes = {"tf": targeted_task["tf"], "pt": targeted_task["pt"]}
--> 788 framework, model = infer_framework_load_model(
    789     model,
    790     model_classes=model_classes,
    791     config=config,
    792     framework=framework,
    793     task=task,
    794     **hub_kwargs,
    795     **model_kwargs,
    796 )
    798 model_config = model.config
    799 hub_kwargs["_commit_hash"] = model.config._commit_hash

File /path/to/my/project/.venv/lib/python3.10/site-packages/transformers/pipelines/, in infer_framework_load_model(model, config, model_classes, task, framework, **model_kwargs)
    277     if isinstance(model, str):
    278         raise ValueError(f"Could not load model {model} with any of the following classes: {class_tuple}.")
--> 280 framework = infer_framework(model.__class__)
    281 return framework, model

File /path/to/my/project/.venv/lib/python3.10/site-packages/transformers/utils/, in infer_framework(model_class)
    581         return "flax"
    582 else:
--> 583     raise TypeError(f"Could not infer framework from class {model_class}.")

TypeError: Could not infer framework from class <class 'optimum.onnxruntime.modeling_ort.ORTModelForCustomTasks'>.

Notably, I was able to create one before.

That’s weird.

Did you upgrade Transformers at some point between your tries? It seems the way the framework is inferred has changed in v4.30. Could you try with v4.29 please?