ML for Audio Study Group - Kick Off (Dec 14)

osanseviero · December 12, 2021, 9:27pm

This week we’re kicking off the first session of the ML for Audio Study Group! The first three sessions will be an overview of audio, ASR and TTS. There will be some presentations at the beginning related to the suggested resources and time to answer questions at the end.

Topic: Kickoff + Overview of Audio related use cases

Suggested Resources (To be read before)

The 3 Deep Learning Frameworks For End-to-End Speech Recognition That Power Your Devices | by James Le | Heartbeat

How to join

Watch the livestream at ML for Audio Study Group - Kick Off - YouTube
Join in our Discord (hf.co/join/discord at the channel #ml-4-audio-study-group)
Check out the GitHub repository of the project: GitHub - Vaibhavs10/ml-with-audio: HF's ML for Audio study group

You can post all your questions in this topic! They will be answered during the session

osanseviero · December 13, 2021, 8:16am

Simple question to kick things off. Apart from ASR and TTS, what other kinds of tasks can be solved in the audio domain?

abdouaziiz · December 13, 2021, 10:17am

Here some tasks related to the audio domain :

speaker diarization: which speaker spoke when?
speaker recognition: who spoke?
sentiment analysis: how does the speaker feel?

amitness · December 13, 2021, 2:31pm

Are there any works that directly peform modeling on raw audio signals itself (e.g. say sentiment or emotion) without using text as intermediate medium? How common is it compared to doing ASR and then running regular NLP models on text?
I have been wanting to build a side project for counting filler words (uhm, basically etc.) I use when I am speaking. What sort of audio task does this map into and what should I learn to be able to build this?

osanseviero · December 13, 2021, 5:12pm

No, we’ll answer the questions during the session being livestreamed tomorrow. The YT link is in the top.

ysharma · December 14, 2021, 4:21am

Just out of curiosity - Is it possible to separate words and music from a song audio?

adorkin · December 14, 2021, 10:35am

I’d like to join @amitness’s first question and add some context. A significant portion of information is lost when we transform speech to text. There’s tempo, timbre, volume, pitch, clarity of pronunciation, logical stress. All these aspects may be helpful on downstream tasks, as it would seem. Arguably, to determine an emotion being expressed pitch, volume and tempo might be just as useful (or even more useful!) as the content itself. Meanwhile, logical stress may change the meaning of an utterance significantly. For example, consider italicizing may instead meaning in my previous sentence. I think it makes sense to consider using non-verbal information for automatic speech understanding and generation. Is that a thing currently?

apol · December 14, 2021, 10:59am

grande Omar, great initiative!

adorkin · December 14, 2021, 11:39am

I also have some question related to the suggested article. So how are time step intervals are determined? Are these intervals constant? Different phonemes have different durations (that also depend on a speaker’s individual characteristics). Moreover, in some languages the difference between long and short vowels, for example, is grammatically significant, so it seems that a shorter window wouldn’t capture that difference. Is that accounted for somehow?

ysharma · December 14, 2021, 4:50pm

Is there a Colab available where we can try speech recognition on beginner / intermediate level? Sorry if it is already shared.

reach-vb · December 15, 2021, 11:17am

Hey y’all!

It was nice catching up with you yesterday during the live stream.

Below are the links to all the papers I referenced for your questions:

@amitness (answered at 41:33 and 44:10)

Nowadays, it is almost a common practice to run your experiments directly on the speech signal itself. Those signals have much more robust featuers that the model can learn from.
A good example of this is explained in the translatotron article: Google AI Blog: Introducing Translatotron: An End-to-End Speech-to-Speech Translation Model
Counting filler words or just removing filler words from a speech signal is an interesting problem and is also a very important speech enhancement use case as well.
a naive way of doing this would be to just flag disfluency in the speech, anywhere you see abrupt patterns/ breaks in the overall text flag them in the dataset and then train a classifier to look at 10ms breaks and classify them as “filler word” or not.
You can find a similar approach in this paper: https://arxiv.org/pdf/1812.03415.pdf

@ysharma (answered at 49:06
It is very much possible, different instruments have different energies and these can easily be separated by PCA/ SVD.
I found this paper: RPCA-based real-time speech and music separation method - ScienceDirect which attempts at doing that with a modified PCA routine.

@adorkin (answered at 51:00)
I think my response to @amitness #1 should be sufficient to answer your Question but do let me know if you have any follow-up or clarification questions.

for your second question, you are absolutely correct about it, specifically for English we work with 10ms windows and this may change for other languages. We also employ another algorithm and loss function called CTC (Connectionist Temporal Classification)

“The intuition of CTC is to output a single character for every frame of the input, so that the output is the same length as the input, and then to apply a collapsing function that combines sequences of identical letters, resulting in a shorter sequence.” - SLP CH 26 (26.4)

We’ll be covering this a bit next Tuesday too.

Regarding colabs - you can find some ready-to-use colabs at https://speechbrain.github.io/ most of them are well commented and provide a good overview.

Cheers!
VB

lbourdois · December 15, 2021, 2:08pm

In relation to question 1 of @amitness , in terms of performance (in terms of results but also in terms of inference time), is it more efficient to proceed in two steps (ASR → transcription → task to be processed via an NLP model) or in one step: audio → task to be processed (sentiment analysis, NER, etc.)?

reach-vb · December 15, 2021, 4:06pm

Speech, in general, is way more feature-rich when compared with text (just a 1D representation of information). This also means that it requires a rather large-sized model to process it.
In my experience so far, wherever inference has been the most important criteria, you’re much better off converting to text and working with it a lot of voice assistants use that for intent classification at least.

Topic		Replies	Views
ML for Audio Study Group - Intro to Audio and ASR (Dec 21) Community Calls	10	2422	December 22, 2021
ML for Audio Study Group - Text to Speech Deep Dive (Jan 4) Community Calls	10	3245	January 10, 2022
Intent (audio) Classification Challenge 🤗Hub	0	932	February 7, 2022
ML for Audio Study Group - pyctcdecode (Jan 18) Community Calls	10	1829	January 18, 2022
Community content of the week (01/13/2022) Community Calls	0	1736	January 13, 2022

ML for Audio Study Group - Kick Off (Dec 14)

Related topics