Support for ASR inference on longer audiofiles or on live transcription?

Hi,

I ve finetuned some ASR models (Whisper and XLSR/wave2vec), and I want to use them now for inference. I ve deployed an API on the cloud (GCP) and it works ok for short audiofiles (up to 2 to 3 minutes). However i want to work either with larger files or with live transcription. So here are my two questions:

  • Are there functions within huggingface, which I might have overlooked, that simplify working with larger audiofiles? May be some kind of turning the audio into stream? Or automatically working with chunks?
  • similar question for live inference: is there any support within huggingface to use these large models for live transcriptions?
    thanks :slight_smile:

looking for similar solution as well, anyone? @sanchit-gandhi

Here you go! See openai/whisper-large-v2 · Hugging Face and Google Colab