Audio classifier in TFLite format

I’m making a model to run on an Android phone and which will be able to recognise a set of specific audio commands. There are 6 commands, so I need a classifier with 7 classes, one for each command plus a class for anything unrecognised.

To do this, I first took facebook/wav2vec2-base, and trained it on a dataset with 1000 examples for each command class, and a further 2000 with “unrecognised” words. The classifier performed excellently.
TFLite seemed the best way of getting the model onto Android, so used optimum to export it to TFLite format (optimum-cli export tflite --task audio-classification ...). This wasn’t easy as at first it failed with the error KeyError: "wav2vec2 (tf_wav2_vec2_for_sequence_classification) is not supported yet with the tflite backend. Only ['onnx'] are supported. If you want to support tflite please propose a PR or open up an issue.".
Eventually I exported it to onnx, then to tf, and from there to TFLite. The model was too big, ~500MB, so I used dynamic quantisation to get it down to ~100MB. Integer quantisation really messed up the model, so I left it there.

I wanted to compare that model with sthg simpler, but wav2vec2 doesn’t have any “small” or “tiny” variants. Instead I used Whisper, which is intended for ASR (not classification), using the openai/whisper-tiny variant. Loaded it using transformers.WhisperForAudioClassification, since classification is my goal, and finetuned it on the same dataset. Despite being a lot smaller (30MB) it outperformed the first model - great.

The problem came when trying to export this model (finetuned Whisper model for classification), to TFLite:

  • Exporting direct to TFLite (optimum-cli export tflite --task audio-classification ...) didn’t work because task audio-classification only recognises Wav2Vec2. So using it for a Whisper model raises ValueError: Unrecognized configuration class <class 'transformers.models.whisper.configuration_whisper.WhisperConfig'> for this kind of AutoModel: TFAutoModelForAudioClassification. Model type should be one of Wav2Vec2Config. Presumed explanation: Whisper is essentially ASR, not a classifier.
  • Exporting to ONNX (optimum-cli export onnx --task audio-classification ...) didn’t work because ValueError: Asked to export a whisper model for the task audio-classification, but the Optimum ONNX exporter only supports the tasks feature-extraction, feature-extraction-with-past, automatic-speech-recognition, automatic-speech-recognition-with-past for whisper. Please use a supported task. Please open an issue at https://github.com/huggingface/optimum/issues if you would like the task audio-classification to be supported in the ONNX export for whisper.

Any suggestions for how I can get this model working on Android? I’ve tried to explain some background and what I’ve tried so far, and included details in case anyone has suggestions of better ways of achieving my goal. Any ideas greatly appreciated!