Export M2M100 model to ONNX

I’ve port facebook/m2m100_418M to ONNX for translation task using this but when visualize by netron, it required 4 inputs: input_ids, attention_mask, decoder_input_ids, decoder_attention_mask and I don’t know how to inference with ONNX-runtime.

How can I solve this problem ?
Thanks in advance for your help.

Did you find a solution?

I have the same issue. Have you found a solution yet?

I tried to convert this model with onnx by adding this type of the task python3.8 -m transformers.onnx --model=facebook/m2m100_418M onnx/ --feature=seq2seq-lm-with-past, but in this case it says that it needs 54 inputs, otherwise I have the same problem. I know that the model needs the input and output language but I can’t really understand how to use the model with onnx. An example would be welcome :wink:

I also looked for indications in the commit of the model: M2M100 support for ONNX export by michaelbenayoun · Pull Request #15193 · huggingface/transformers · GitHub. I think it can be useful.

cc @lewtun

Also having the same question, Could I have an example for this m2m-100 onnx model? It will be very helpful.

Hi folks, the best way to run inference with ONNX models is via the optimum library. This library allows you to inject ONNX models directly in the pipeline() function from transformers and thus skip all the annoying pre- and post-processing steps :slight_smile:

Here’s a demo for M2M100 based on the docs:

from transformers import AutoTokenizer, pipeline
from optimum.onnxruntime import ORTModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("facebook/m2m100_418M")
# `from_transformers` will export the model to ONNX on-the-fly 🤯
model = ORTModelForSeq2SeqLM.from_pretrained("facebook/m2m100_418M", from_transformers=True)
onnx_translation = pipeline("translation_en_to_de", model=model, tokenizer=tokenizer)

text = "My name is Lewis."
# returns [{'translation_text': 'Mein Name ist Lewis.'}]
pred = onnx_translation(text)

Hope that helps!


Running into following error when I run code as is from @lewtun

AttributeError: type object 'FeaturesManager' has no attribute 'determine_framework'

Using following version:
torch → ‘1.10.0’
transformers → ‘4.20.1’

cc @fxmarty who might be able to take a look :pray:

1 Like

Thanks, also, not sure where is target language ‘de’ mentioned above in tokenizer/model. Greatly appreciate your help.

Hi @awaiskaleem , transformers==4.20.1 is 5 months old. Could you try to update (pip install --upgrade transformers)? Current supported stable version is 4.20.0. The code snippet from @lewtun works well for me with transformers==4.20.0 and optimum==1.5.1.

Additionally @NNDam , @double @omoekan , @Jour , @echoRG , @awaiskaleem , I wanted to let you know that the ONNX export through transformers.onnx will likely soon rely on a soft dependency to optimum.exporters where all things export will be maintained. You can check the documentation here.

Now, specifically for M2M100, keep in mind that it is a seq2seq (translation) model! Hence, it uses both an encoder and decoder, as detailed in transformers doc. In transformers, the standard use is to model.generate(**inputs). However, by default the ONNX export can not handle the loop that there is in the decoder: transformers/utils.py at d51e7c7e8265d69db506828dce77eb4ef9b72157 · huggingface/transformers · GitHub . Hence, when exporting to ONNX in a single file, unless you do some manual surgery on the ONNX graph, the model will be hardly usable.

The solution that is currently explored & in use in Optimum’s ORTModelForSeq2SeqLM leveraging ONNX Runtime is to use two ONNX files: one for the encoder, and one for the decoder.

Using Optimum main (not yet in the stable release, but you can expect it next week), python -m optimum.exporters.onnx --model valhalla/m2m100_tiny_random --for-ort m2m100_tiny_onnx_ort, we obtain two models:

  • an encoder expecting the input_ids, attention_mask
  • a decoder expecting encoder_attention_mask, input_ids and encoder_hidden_states. This follows closely transformers decoder and generate.

So if you would like to use these exported ONNX models outside of Optimum, I simply recommend to use the above command to export and handle yourself the models then. But ORTModelForSeq2SeqLM is meant to save you the hassle.

If you want to try it right away, feel free to try the dev version: pip install -U git+https://github.com/huggingface/optimum.git@main

Edit 2022-12-27: Feel free to have a look at the latest release notes which includes the feature: Release v1.6.0: Optimum CLI, Stable Diffusion ONNX export, BetterTransformer & ONNX support for more architectures · huggingface/optimum · GitHub

1 Like

This partially worked for me. I mean I was able to load the infer the model successfully but the text wasn’t translated into “de” in your example. The result I got was as following:

  pred: [{'translation_text': 'de: My name is Lewis.'}]

The model I’m using is: “facebook/nllb-200-distilled-600M”

Hi there! Thanks for the insights, I have a couple of questions about the architecture of the ORT seq2seq model that I hope you could clarify.

Firstly, I’m curious why Optimum requires the encoder and decoder to be loaded from two separate ONNX files, instead of a single ONNX file? I’m guessing (from a quick glance at the source code) that it’s because it utilizes two ORT inference sessions for the encoder and decoder instead of using a single session for the entire model – is there a rationale for this design?

Secondly, you mentioned that the ONNX export of the model faces difficulties with the generate loop in the decoder, while the transformers model seems to handle it fine. I’m wondering what specifically in the generate loop makes it challenging for the ORT model to handle?

@ahmedbr Coud you fill a bug report with a reproduction script on Issues · huggingface/optimum · GitHub so that I can have a look at it?

@luckyt The main reason is because you normally want to run the encoder only once, while you’d like to loop over the decoder when generating. You could say, ok why not wrap everything into a single ONNX, with say an If node to decide whether or not to run the encoder? Something like this with subgraphs:

This could be doable actually. The issue with that is that usability is a bit harder, as the encoder and decoder do not have the same inputs/outputs. So you would need to create fake input/outputs, which theoretically works, but may lead into errors and be a bit unintuitive.

About generation, what is slightly challenging is that inputs/outputs are fixed with ONNX, and more importantly when exporting we use torch.jit.trace that can not handle controlflows, that are typically use to handle the without/with past (use KV cache or not) case. In the first step of the generation, you don’t use the KV cache, while in later steps you do. See transformers/src/transformers/models/t5/modeling_t5.py at v4.30.2 · huggingface/transformers · GitHub & How does the ONNX exporter work for GenerationModel with `past_key_value`?

1 Like