Exported custom segformer model with pretrained weights ("nvidia/segformer-b0-finetuned-ade-512-512") to ONNX takes longer for inference compare to PyTorch!

Fkabirr · June 20, 2024, 7:51am

installed packages:

python=3.10.0
onnxruntime-gpu=1.17.0
pytorch =2.2.2
pytorch-cuda=11.8
pytorch-lightning =1.9.3
transformers=4.41.2

I have trained SegFormer model with pretrained weights as
mentioned with classifier layer:

from transformers import SegformerForSemanticSegmentation

class SegformerFinetuner(LightningModule):
    def __init__(self, model_name="nvidia/segformer-b0-finetuned-ade-512-512", learning_rate=1e-4):
        super().__init__()
        self.model = SegformerForSemanticSegmentation.from_pretrained(model_name) # input channel --> 3; output channel --> 150
        self.classifier = nn.Sequential(
            nn.Conv2d(150, 64, kernel_size=3, padding=1),
            nn.Dropout(0.1),
            nn.ReLU(),
         
            nn.Conv2d(64, 1, kernel_size=1), 
        )

After training, I exported it to the onnx with:

inference_model = SegformerFinetuner() 
inference_model.load_state_dict(torch.load(".\\pt_model\\trained_model.pt"))
inference_model.cuda() 
inference_model.eval() 

dummy_input = torch.randn(1, 3, 384, 384, requires_grad=True).cuda()

dynamic_axes={
        "input": {0: "batch", 2: "height", 3: "width"},
        "output": {0: "batch", 2: "height", 3: "width"}
    }

# export the model to onnx
torch.onnx.export(inference_model, dummy_input, f"model_{model_id}.onnx", 
                    input_names=['input'], output_names=['output'], 
                    do_constant_folding=True, 
                   opset_version=16,dynamic_axes=dynamic_axes, verbose=False, 
                   export_params=True) 

# start inference
options = SessionOptions()
options.graph_optimization_level = GraphOptimizationLevel.ORT_ENABLE_ALL
onnx_session= onnxrt.InferenceSession(f"model_{model_id}.onnx", options, providers=['CUDAExecutionProvider']) 

onnx_inputs= {onnx_session.get_inputs()[0].name: img_torch.unsqueeze(0).cpu().numpy()}

# perform inference and calculate time taken
t_0 = time.perf_counter()
onnx_output = onnx_session.run(None, onnx_inputs)[0]
t_1 = time.perf_counter()

Inference time output:

second inference time with PyTorch on GPU RTX4090–> torch.Size([155, 3, 384, 384]): 0.02 seconds
second inference time with ONNX with GPU RTX4090–> torch.Size([155, 3, 384, 384]) is → 5.06 seconds

I tried Export 🤗 Transformers Models with a thought that my be this would short the inference time but couldn’t succeed with tokenizer issue

from transformers import AutoTokenizer, AutoModelForSemanticSegmentation

# Load tokenizer and PyTorch weights form the Hub

pt_model = AutoModelForSemanticSegmentation.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512")
tokenizer = AutoTokenizer.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512")
local_pt_checkpoint = ".\\segformer_trained_checkpoint"

# save
pt_model.save_pretrained(local_pt_checkpoint)

# export to onnx
!python -m transformers.onnx --model=local_pt_checkpoint onnx/

error:

KeyError                                  Traceback (most recent call last)
Cell In[38], line 6
---> 6 tokenizer = AutoTokenizer.from_pretrained("nvidia/segformer-b0-finetuned-ade-512-512"

KeyError: <class 'transformers.models.segformer.configuration_segformer.SegformerConfig'>

Is there any way to reduce the inference time for ONNX or any flag i’m missing in onnx export call? Any help would be appreciated. Thank you

Topic		Replies	Views
Exporting SegFormer Image Processor to ONNX Format Using "optimum.exporters.onnx.onnx_export_from_model" 🤗Optimum	0	98	July 30, 2024
ONNX exported model outputs different value per inference call for the same input Beginners	1	358	September 4, 2021
Using PyTorch model in TensorFlow 🤗Transformers	2	2249	June 7, 2023
Transformers.onnx vs optimum.onnxruntime 🤗Optimum	1	1133	September 12, 2022
How to successfully ONNX pretrained models Beginners	7	238	January 26, 2025

Exported custom segformer model with pretrained weights ("nvidia/segformer-b0-finetuned-ade-512-512") to ONNX takes longer for inference compare to PyTorch!

Related topics