ORTModelForSpeechSeq2Seq load Openai/whisper-large-v3 failed

My spec

  • optimum: 1.23.3
  • transformers: 4.46.3
  • python: 3.10.14
  • onnx: 1.17.0
  • onnxruntime-gpu: 1.20.1

And this is my code:

class SpeechToTextServicer(speech_to_text_pb2_grpc.SpeechToTextServicer):
    def __init__(self):
        device = 'cuda' if gpu and torch.cuda.is_available() else 'cpu'
        # torch_dtype = torch.float16 if gpu and torch.cuda.is_available() else torch.float32
        model_path = './whisper-large-v3'
        model = ORTModelForSpeechSeq2Seq.from_pretrained(
            model_path,
            export=True,
            provider='CUDAExecutionProvider' if gpu else 'CPUExecutionProvider',

        )
        # model = AutoModelForSpeechSeq2Seq.from_pretrained(model_path)
        self.processor = AutoProcessor.from_pretrained(model_path)

        self.pipe = pipeline(
            "automatic-speech-recognition",
            model=model,
            tokenizer=self.processor.tokenizer,
            feature_extractor=self.processor.feature_extractor,
            # torch_dtype=torch_dtype,
            device=0,
        )

I got error when gpu was True, and with local file
I was cloned the repository

Moving the following attributes in the config to the generation config: {'max_length': 448, 'begin_suppress_tokens': [220, 50257]}. You are seeing this warning because you've set generation parameters in the model config, as opposed to in the generation config.
/home/zsq/miniconda3/envs/stt/lib/python3.10/site-packages/transformers/models/whisper/modeling_whisper.py:1017: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if input_features.shape[-1] != expected_seq_length:
/home/zsq/miniconda3/envs/stt/lib/python3.10/site-packages/transformers/models/whisper/modeling_whisper.py:334: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if attn_output.size() != (bsz, self.num_heads, tgt_len, self.head_dim):
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
/home/zsq/miniconda3/envs/stt/lib/python3.10/site-packages/transformers/models/whisper/modeling_whisper.py:1477: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if sequence_length != 1:
/home/zsq/miniconda3/envs/stt/lib/python3.10/site-packages/transformers/cache_utils.py:458: TracerWarning: Using len to get tensor shape might cause the trace to be incorrect. Recommended usage would be tensor.shape[0]. Passing a tensor of different shape might lead to errors or silently give incorrect results.
  or len(self.key_cache[layer_idx]) == 0  # the layer has no cache
/home/zsq/miniconda3/envs/stt/lib/python3.10/site-packages/transformers/cache_utils.py:443: TracerWarning: Using len to get tensor shape might cause the trace to be incorrect. Recommended usage would be tensor.shape[0]. Passing a tensor of different shape might lead to errors or silently give incorrect results.
  elif len(self.key_cache[layer_idx]) == 0:  # fills previously skipped layers; checking for tensor causes errors
2025-01-15 14:03:13.094558764 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2025-01-15 14:03:13.094769152 [W:onnxruntime:, session_state.cc:1170 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2025-01-15 14:03:15.207394243 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2025-01-15 14:03:15.207409973 [W:onnxruntime:, session_state.cc:1170 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2025-01-15 14:03:18.220428029 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2025-01-15 14:03:18.220442709 [W:onnxruntime:, session_state.cc:1170 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
2025-01-15 14:03:18.442658923 [E:onnxruntime:, inference_session.cc:2117 operator()] Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:376 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 6553600

Traceback (most recent call last):
  File "/home/zsq/projects/AlgoServer/langchain_rag/stt/server.py", line 66, in serve
    speech_to_text_pb2_grpc.add_SpeechToTextServicer_to_server(SpeechToTextServicer(), server)
  File "/home/zsq/projects/AlgoServer/langchain_rag/stt/server.py", line 23, in __init__
    model = ORTModelForSpeechSeq2Seq.from_pretrained(
  File "/home/zsq/miniconda3/envs/stt/lib/python3.10/site-packages/optimum/onnxruntime/modeling_ort.py", line 737, in from_pretrained
    return super().from_pretrained(
  File "/home/zsq/miniconda3/envs/stt/lib/python3.10/site-packages/optimum/modeling_base.py", line 438, in from_pretrained
    return from_pretrained_method(
  File "/home/zsq/miniconda3/envs/stt/lib/python3.10/site-packages/optimum/onnxruntime/modeling_seq2seq.py", line 1090, in _from_transformers
    return cls._from_pretrained(
  File "/home/zsq/miniconda3/envs/stt/lib/python3.10/site-packages/optimum/onnxruntime/modeling_seq2seq.py", line 1377, in _from_pretrained
    return _ORTModelForWhisper._from_pretrained(model_id, config, **kwargs)
  File "/home/zsq/miniconda3/envs/stt/lib/python3.10/site-packages/optimum/onnxruntime/modeling_seq2seq.py", line 1400, in _from_pretrained
    return super(ORTModelForSpeechSeq2Seq, cls)._from_pretrained(model_id, config, **kwargs)
  File "/home/zsq/miniconda3/envs/stt/lib/python3.10/site-packages/optimum/onnxruntime/modeling_seq2seq.py", line 989, in _from_pretrained
    ort_inference_sessions = cls.load_model(
  File "/home/zsq/miniconda3/envs/stt/lib/python3.10/site-packages/optimum/onnxruntime/modeling_seq2seq.py", line 761, in load_model
    decoder_with_past_session = ORTModel.load_model(
  File "/home/zsq/miniconda3/envs/stt/lib/python3.10/site-packages/optimum/onnxruntime/modeling_ort.py", line 399, in load_model
    return ort.InferenceSession(
  File "/home/zsq/miniconda3/envs/stt/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 465, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/home/zsq/miniconda3/envs/stt/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 537, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:376 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool, onnxruntime::Stream*, bool, onnxruntime::WaitNotificationFn) Failed to allocate memory for requested buffer of size 6553600

How to fix it.

How about this?

gpu = False

class SpeechToTextServicer(speech_to_text_pb2_grpc.SpeechToTextServicer):
    def __init__(self):
        device = 'cuda' if gpu and torch.cuda.is_available() else 'cpu'
        torch_dtype = torch.float16 if gpu and torch.cuda.is_available() else torch.float32
        model_path = './whisper-large-v3'
        model = ORTModelForSpeechSeq2Seq.from_pretrained(
            model_path,
            export=True,
            provider='CUDAExecutionProvider' if gpu else 'CPUExecutionProvider',

        )
        # model = AutoModelForSpeechSeq2Seq.from_pretrained(model_path)
        self.processor = AutoProcessor.from_pretrained(model_path)

        self.pipe = pipeline(
            "automatic-speech-recognition",
            model=model,
            tokenizer=self.processor.tokenizer,
            feature_extractor=self.processor.feature_extractor,
            torch_dtype=torch_dtype,
            device=device, # "device=0" means using primary GPU
        )