Speech to text on constrained hardware (embedded)

Good day to you all.

I’ve been interested in implementing a speech to text model on a low power micro controller with an embedded AI accelerator.

I started out by trying to replicate the JASPER model as published in their research paper here: https://arxiv.org/pdf/1904.03288

I knew the hardware was limited and now I realize just how severe the limitations are. These are all for 1-D Convolutional layer (conv1d):

  • Padding can not be more than 0 when using more than 64 input channels
  • Kernel size can only be 1-9
  • Padding can only be 1, 2 or 3
  • Stride can only be 1
  • Dilation can be 1 to 1023 for kernel lengths 1, 2, or 3 and is fixed to 1 for kernels with length greater than 3.

Needless to say JASPER might be a bit of a pipe dream. Is there another approach that might work with these limitations in mind? I don’t need a super cracked low WER model, but I don’t want it to be completely useless either.

Thanks