I’m currently working on implementing a fine-tuned Whisper Small model wrapped with Sherpa ONNX in an Android app using Flutter. The goal is to achieve real-time speech-to-text functionality. I’ve been following the documentation provided here: Sherpa ONNX Flutter Examples.
My fine-tuned model is designed for the German language and includes a specific vocabulary. However, I’ve encountered an issue: the documentation mentions the need for a joiner file, but my research indicates that the Whisper model already includes its joiner within the tensor architecture.
I’m looking for any workarounds or additional documentation that could help me integrate the Sherpa ONNX Whisper model into my app without the need for a separate joiner file.
It’s fantastic that you’re working on integrating a fine-tuned Whisper model with Sherpa ONNX in a Flutter-based Android app! Achieving real-time speech-to-text in a specific language is a challenging yet rewarding task.
In my opinion there are several ways to solve the problem. And following two methods are good for you.
Sherpa ONNX Expectations:
Sherpa ONNX often assumes models follow a certain structure, especially for RNN-T-based architectures, which include explicit joiner components. However, Whisper’s architecture (transformer-based) doesn’t require an external joiner since it incorporates all relevant processing within its tensor architecture.
Workaround Ideas:
Customize Sherpa ONNX:
You may need to bypass or adapt the Sherpa ONNX codebase to account for Whisper’s unique architecture. For instance, investigate how Sherpa ONNX expects the joiner to be used and modify those parts to align with Whisper’s outputs.
Simplify Output Matching:
If the joiner logic is primarily for aligning model outputs with a vocabulary, you might manually map Whisper’s decoded outputs to your specific vocabulary.
Feel free to share more details about the integration process, and I’d be happy to brainstorm further! Best of luck with your app development — it sounds like an amazing project!
Thank you for your input! I really appreciate it. The documentation provided by csukuangfj is already excellent. However, I was wondering if you also have a solution for streaming?
We support two-pass ASR, where in the first pass we use a small, fast, but less accurate streaming model and in the second pass we use a non-streaming but more accurate model.
The problem we’re currently experiencing is not related to pre-built models. We have an ai engineer create a custom streaming transcription multi-language model that supports german and English.
The problem is this, I am currently making use of sherpa_onnx package in the flutter app which is the only reasonable pub dev package that supports kaldi, whisper or sherpa models.
To use the models, the package requires we include the encoder, decoder, and joiner. However, the model the ai engineer created has the joiner embedded into it and not extracted.
What we’re concerned about now is how to consume this model without having the app continuously crashing.
Checked the attached image to see how other sherpa models are meant to be used by default because the joiners are separated by default.
If you use a streaming paraformer or a streaming zipformer-CTC model, then you don’t need a joiner at all.
We have an ai engineer create a custom streaming transcription multi-language model that supports german and English
Is your model a streaming whisper model?
If yes, then the current sherpa-onnx does not support it.
As you said, it is a customized streaming model; I think you cannot find anywhere to support your customized streaming model.
If your streaming model is open-sourced and if you provide an ONNX exported model, then we can support it in sherpa-onnx and also provide Dart/Flutter examples for it. Otherwise, you need to change sherpa-onnx to support it by yourself.