Using a fine tuned whisper sherpa onnx model to create a android app with flutter

Hi everyone,

I’m currently working on implementing a fine-tuned Whisper Small model wrapped with Sherpa ONNX in an Android app using Flutter. The goal is to achieve real-time speech-to-text functionality. I’ve been following the documentation provided here: Sherpa ONNX Flutter Examples.

My fine-tuned model is designed for the German language and includes a specific vocabulary. However, I’ve encountered an issue: the documentation mentions the need for a joiner file, but my research indicates that the Whisper model already includes its joiner within the tensor architecture.

I’m looking for any workarounds or additional documentation that could help me integrate the Sherpa ONNX Whisper model into my app without the need for a separate joiner file.

Thank you in advance for your help!

Best,
Jules

1 Like

Hi Jules,

It’s fantastic that you’re working on integrating a fine-tuned Whisper model with Sherpa ONNX in a Flutter-based Android app! Achieving real-time speech-to-text in a specific language is a challenging yet rewarding task.

In my opinion there are several ways to solve the problem. And following two methods are good for you.

  1. Sherpa ONNX Expectations:
    Sherpa ONNX often assumes models follow a certain structure, especially for RNN-T-based architectures, which include explicit joiner components. However, Whisper’s architecture (transformer-based) doesn’t require an external joiner since it incorporates all relevant processing within its tensor architecture.

  2. Workaround Ideas:

    • Customize Sherpa ONNX:
      You may need to bypass or adapt the Sherpa ONNX codebase to account for Whisper’s unique architecture. For instance, investigate how Sherpa ONNX expects the joiner to be used and modify those parts to align with Whisper’s outputs.
    • Simplify Output Matching:
      If the joiner logic is primarily for aligning model outputs with a vocabulary, you might manually map Whisper’s decoded outputs to your specific vocabulary.

Feel free to share more details about the integration process, and I’d be happy to brainstorm further! Best of luck with your app development — it sounds like an amazing project!

2 Likes

I am one of the authors of sherpa-onnx.

Please use

It is a pure dart example. However, it contains everything that you need to make a Flutter APP based on it.

2 Likes

Hi there,

Thank you for your input! I really appreciate it. The documentation provided by csukuangfj is already excellent. However, I was wondering if you also have a solution for streaming?

Best regards,

Jules

1 Like

We support two-pass ASR, where in the first pass we use a small, fast, but less accurate streaming model and in the second pass we use a non-streaming but more accurate model.

You can find pre-built two-pass ASR APKs with Whisper inside at
https://k2-fsa.github.io/sherpa/onnx/android/apk-2pass.html

(Search for whisper or moonshine in the above link)

small_zipformer_moonshine_tiny_int8.apk

You can also implement two-pass ASR in your Flutter app. The Dart API from sherpa-onnx has everything you need to implement that.

1 Like


The problem we’re currently experiencing is not related to pre-built models. We have an ai engineer create a custom streaming transcription multi-language model that supports german and English.

The problem is this, I am currently making use of sherpa_onnx package in the flutter app which is the only reasonable pub dev package that supports kaldi, whisper or sherpa models.

To use the models, the package requires we include the encoder, decoder, and joiner. However, the model the ai engineer created has the joiner embedded into it and not extracted.

What we’re concerned about now is how to consume this model without having the app continuously crashing.

Checked the attached image to see how other sherpa models are meant to be used by default because the joiners are separated by default.

1 Like

To use the models, the package requires we include the encoder, decoder, and joiner.

That is not true. You only need to pass a joiner if you are using a transducer model.

I hope you know what a transducer model is.

As we can see from the screenshot you posted, at line 6, it returns
Future<sherpa_onnx.OnlineModelConfig>

The definition of sherpa_onnx.OnlineModelConfig can be found at

If you use a streaming paraformer or a streaming zipformer-CTC model, then you don’t need a joiner at all.


We have an ai engineer create a custom streaming transcription multi-language model that supports german and English

Is your model a streaming whisper model?
If yes, then the current sherpa-onnx does not support it.

As you said, it is a customized streaming model; I think you cannot find anywhere to support your customized streaming model.

If your streaming model is open-sourced and if you provide an ONNX exported model, then we can support it in sherpa-onnx and also provide Dart/Flutter examples for it. Otherwise, you need to change sherpa-onnx to support it by yourself.

1 Like