ML for Audio Study Group - Kick Off (Dec 14)

This week we’re kicking off the first session of the ML for Audio Study Group! The first three sessions will be an overview of audio, ASR and TTS. There will be some presentations at the beginning related to the suggested resources and time to answer questions at the end.

Topic: Kickoff + Overview of Audio related use cases

Suggested Resources (To be read before)

How to join

You can post all your questions in this topic! They will be answered during the session


Simple question to kick things off. Apart from ASR and TTS, what other kinds of tasks can be solved in the audio domain?


Here some tasks related to the audio domain :

  • speaker diarization: which speaker spoke when?
  • speaker recognition: who spoke?
  • sentiment analysis: how does the speaker feel?
  1. Are there any works that directly peform modeling on raw audio signals itself (e.g. say sentiment or emotion) without using text as intermediate medium? How common is it compared to doing ASR and then running regular NLP models on text?

  2. I have been wanting to build a side project for counting filler words (uhm, basically etc.) I use when I am speaking. What sort of audio task does this map into and what should I learn to be able to build this?


No, we’ll answer the questions during the session being livestreamed tomorrow. The YT link is in the top.

1 Like

Just out of curiosity - Is it possible to separate words and music from a song audio?


I’d like to join @amitness’s first question and add some context. A significant portion of information is lost when we transform speech to text. There’s tempo, timbre, volume, pitch, clarity of pronunciation, logical stress. All these aspects may be helpful on downstream tasks, as it would seem. Arguably, to determine an emotion being expressed pitch, volume and tempo might be just as useful (or even more useful!) as the content itself. Meanwhile, logical stress may change the meaning of an utterance significantly. For example, consider italicizing may instead meaning in my previous sentence. I think it makes sense to consider using non-verbal information for automatic speech understanding and generation. Is that a thing currently?


grande Omar, great initiative!

1 Like

I also have some question related to the suggested article. So how are time step intervals are determined? Are these intervals constant? Different phonemes have different durations (that also depend on a speaker’s individual characteristics). Moreover, in some languages the difference between long and short vowels, for example, is grammatically significant, so it seems that a shorter window wouldn’t capture that difference. Is that accounted for somehow?


Is there a Colab available where we can try speech recognition on beginner / intermediate level? Sorry if it is already shared.

1 Like

Hey y’all!

It was nice catching up with you yesterday during the live stream.

Below are the links to all the papers I referenced for your questions:

@amitness (answered at 41:33 and 44:10)

  1. Nowadays, it is almost a common practice to run your experiments directly on the speech signal itself. Those signals have much more robust featuers that the model can learn from.
    A good example of this is explained in the translatotron article: Google AI Blog: Introducing Translatotron: An End-to-End Speech-to-Speech Translation Model
  2. Counting filler words or just removing filler words from a speech signal is an interesting problem and is also a very important speech enhancement use case as well.
    a naive way of doing this would be to just flag disfluency in the speech, anywhere you see abrupt patterns/ breaks in the overall text flag them in the dataset and then train a classifier to look at 10ms breaks and classify them as “filler word” or not.
    You can find a similar approach in this paper:

@ysharma (answered at 49:06
It is very much possible, different instruments have different energies and these can easily be separated by PCA/ SVD.
I found this paper: RPCA-based real-time speech and music separation method - ScienceDirect which attempts at doing that with a modified PCA routine.

@AlekseyDorkin (answered at 51:00)
I think my response to @amitness #1 should be sufficient to answer your Question but do let me know if you have any follow-up or clarification questions.

for your second question, you are absolutely correct about it, specifically for English we work with 10ms windows and this may change for other languages. We also employ another algorithm and loss function called CTC (Connectionist Temporal Classification)

“The intuition of CTC is to output a single character for every frame of the input, so that the output is the same length as the input, and then to apply a collapsing function that combines sequences of identical letters, resulting in a shorter sequence.” - SLP CH 26 (26.4)

We’ll be covering this a bit next Tuesday too.

Regarding colabs - you can find some ready-to-use colabs at most of them are well commented and provide a good overview.



In relation to question 1 of @amitness , in terms of performance (in terms of results but also in terms of inference time), is it more efficient to proceed in two steps (ASR → transcription → task to be processed via an NLP model) or in one step: audio → task to be processed (sentiment analysis, NER, etc.)?


Speech, in general, is way more feature-rich when compared with text (just a 1D representation of information). This also means that it requires a rather large-sized model to process it.
In my experience so far, wherever inference has been the most important criteria, you’re much better off converting to text and working with it a lot of voice assistants use that for intent classification at least.