This week we’re kicking off the first session of the ML for Audio Study Group! The first three sessions will be an overview of audio, ASR and TTS. There will be some presentations at the beginning related to the suggested resources and time to answer questions at the end.
Topic: Kickoff + Overview of Audio related use cases
Are there any works that directly peform modeling on raw audio signals itself (e.g. say sentiment or emotion) without using text as intermediate medium? How common is it compared to doing ASR and then running regular NLP models on text?
I have been wanting to build a side project for counting filler words (uhm, basically etc.) I use when I am speaking. What sort of audio task does this map into and what should I learn to be able to build this?
I’d like to join @amitness’s first question and add some context. A significant portion of information is lost when we transform speech to text. There’s tempo, timbre, volume, pitch, clarity of pronunciation, logical stress. All these aspects may be helpful on downstream tasks, as it would seem. Arguably, to determine an emotion being expressed pitch, volume and tempo might be just as useful (or even more useful!) as the content itself. Meanwhile, logical stress may change the meaning of an utterance significantly. For example, consider italicizing may instead meaning in my previous sentence. I think it makes sense to consider using non-verbal information for automatic speech understanding and generation. Is that a thing currently?
I also have some question related to the suggested article. So how are time step intervals are determined? Are these intervals constant? Different phonemes have different durations (that also depend on a speaker’s individual characteristics). Moreover, in some languages the difference between long and short vowels, for example, is grammatically significant, so it seems that a shorter window wouldn’t capture that difference. Is that accounted for somehow?
Counting filler words or just removing filler words from a speech signal is an interesting problem and is also a very important speech enhancement use case as well.
a naive way of doing this would be to just flag disfluency in the speech, anywhere you see abrupt patterns/ breaks in the overall text flag them in the dataset and then train a classifier to look at 10ms breaks and classify them as “filler word” or not.
You can find a similar approach in this paper: https://arxiv.org/pdf/1812.03415.pdf
@adorkin (answered at 51:00)
I think my response to @amitness#1 should be sufficient to answer your Question but do let me know if you have any follow-up or clarification questions.
for your second question, you are absolutely correct about it, specifically for English we work with 10ms windows and this may change for other languages. We also employ another algorithm and loss function called CTC (Connectionist Temporal Classification)
“The intuition of CTC is to output a single character for every frame of the input, so that the output is the same length as the input, and then to apply a collapsing function that combines sequences of identical letters, resulting in a shorter sequence.” - SLP CH 26 (26.4)
We’ll be covering this a bit next Tuesday too.
Regarding colabs - you can find some ready-to-use colabs at https://speechbrain.github.io/ most of them are well commented and provide a good overview.
In relation to question 1 of @amitness , in terms of performance (in terms of results but also in terms of inference time), is it more efficient to proceed in two steps (ASR → transcription → task to be processed via an NLP model) or in one step: audio → task to be processed (sentiment analysis, NER, etc.)?
Speech, in general, is way more feature-rich when compared with text (just a 1D representation of information). This also means that it requires a rather large-sized model to process it.
In my experience so far, wherever inference has been the most important criteria, you’re much better off converting to text and working with it a lot of voice assistants use that for intent classification at least.