ML for Audio Study Group - Intro to Audio and ASR (Dec 21)

Welcome to the second week of ML for Audio Study Group! :loud_sound: :loud_sound:

There will be some presentations at the beginning related to the suggested resources and time to answer questions at the end.

This week will do a deep dive into the basics: we’ll give a high level overview of audio data and its challenges. Then, we’ll jump into the details of our first task: Automatic Speech Recognition. Beware, there will be memes

Topic: Intro to Audio and ASR Deep Dive

Suggested Resources (To be read before)

With the first reading, the goal is to get a high level understanding. Don’t worry if something is not completely clear or you don’t get all details!

How to join

You can post all your questions in this topic! They will be answered during the session

6 Likes

I’m going to start: What are the current SoTA models for ASR?

1 Like

What are some suitable approaches for dealing with data sparsity, e.g. limited audio data in a specific language? Is increasing the amount of training data via TTS feasible?

2 Likes

What’s the current approach to dealing with different dialects of the same language? Other than diversifying training data, that is.

3 Likes

It is more and more important to be able to get inference on mobile cpus and gpus.

What do you think are the best android projects that implement speech recognition. (being specific here, because android is the most used OS worldwide)

Also , looking into something like DDSP or Timbre Transfer, can we expect apps like voice to trumpet work via android (rather than through the browser (which is super slow)) . How would you go about porting something like DDSP to run on mobile CPU/GPU.

2 Likes

Why havn’t we moved away from the Mel scale yet? it was done in the 1940’s the original paper A Scale for the Measurement of the Psychological Magnitude Pitch as done with 5 participants! is this even replicable?

Quickly looking into it now it seems there are many different Mel curves with slightly different log equations. Without diving too far down the rabbit hole I see a paper even claiming to do better without a log. I’m just wondering what your thoughts are here? Is the Mel scale you use in itself a parameter to tune?

2 Likes

Will you make a tutorial on Audio, so I can generate a certain person’s voice or replace my voice and change it to a certain person’s voice?

1 Like

Hey VB/Omar, just curious to know this – is it possible to identify someone’s voice and replicate it in further producing content in?

1 Like

​I know the focus is ASR but are there any suggestions for what models to use for music (classification)? Seems like most models shown are for speech. By music classification I mean tasks like genre classification, raga classification (specifically for indian classical music), scale classification and so on

1 Like

Hey y’all,

Thanks for joining the stream yesterday, please find @osanseviero and my responses to your questions in the forums and youtube below:

@lbehringer (answered in 53:15)

What are some suitable approaches for dealing with data sparsity, e.g. limited audio data in a specific language? Is increasing the amount of training data via TTS feasible?

When you don’t have too much data you can do two things.

  • Transfer learning. Similar to NLP or Computer Vision, you can take a pretrained model such as Wav2Vec2 or HuBERT. These models were trained in an unsupervised fashion to gain a statistical understanding of the data in which it was trained. You can then use transfer learning by fine-tuning the model with your own data, and this tends to give good results event without much data
  • Data Augmentation. This is a bit more aligned with your proposal about using TTS. Conventional data augmentation techniques in audio are adding background noise, deforming the waveform, overlaying multiple clips, etc. There are other methods such as SpecAugment. Generating synthetic datasets with TTS is not a super explored field, but I found a recent paper from Amazon that could be interesting to you: SynthASR/.

@adorkin (answered in 55:01)

What’s the current approach to dealing with different dialects of the same language? Other than diversifying training data, that is.

One of the more industry relevant approaches is using Connectionist temporal classification, you can read more about it in the SLP 26.4 (https://web.stanford.edu/~jurafsky/slp3/26.pdf)

Recently with the more end-to-end models like Wav2Vec2 or Conformer, they learn through either large pre-training data or by adding both the local and the global context into the hidden states for the models to be more versatile. That said, no architecture is perfect.

@geevegeorge (answered in 57:00)

What do you think are the best android projects that implement speech recognition. (being specific here, because android is the most used OS worldwide)
Also , looking into something like DDSP or Timbre Transfer, can we expect apps like voice to trumpet work via android (rather than through the browser (which is super slow)) . How would you go about porting something like DDSP to run on mobile CPU/GPU.

At the end of the day DDSP is nothing but an autoencoder and you can deal with it the same way as you deal with any deep learning model, either you expose it via an API or convert it to an embedded model (for android using tflite) and infer from it.
Definitely they would work, you can also look at model quantization techniques to make the model faster through a small tradeoff with its performance.

@CupOfGeo (answered in 59:08)

Why haven’t we moved away from the Mel scale yet? it was done in the 1940’s the original paper A Scale for the Measurement of the Psychological Magnitude Pitch as done with 5 participants! is this even replicable?

From bean on the Discord #ml-4-audio-study-group channel - “ it originates from trying to featurize audio in a way that is more relevant to human perception. If you look at how humans hear, you can see how human frequency resolution is better at low frequencies - the mel scale has finer frequency resolution at lower mels and combines more frequencies at higher mels.

I suspect that as long as the tasks being performed are still related to human perception we will still find a benefit in using the mel scale”

@JonathanSum (answered in 1:01:51)

Will you make a tutorial on Audio, so I can generate a certain person’s voice or replace my voice and change it to a certain person’s voice?

Voice style transfer is quite an active area of research in the community. At the end of the day you can leverage something like GitHub - mazzzystar/randomCNN-voice-transfer: Audio style transfer with shallow random parameters CNN. or GitHub - auspicious3000/autovc: AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss to tweak around with the generated spectogram and have it mimic the voice patterns of the target speaker.
P.S. There are plenty of other solutions listed in the project repos.

@ysharma (answered in 1:03:05)

Hey VB/Omar, just curious to know this – is it possible to identify someone’s voice and replicate it in further producing content in?

Identification task is similar to the way I showed in the colab walkthrough and you can look at my response to the question above for voice conversion.

@nishanthc

​I know the focus is ASR but are there any suggestions for what models to use for music (classification)? Seems like most models shown are for speech. By music classification I mean tasks like genre classification, raga classification (specifically for indian classical music), scale classification and so on

Yes, this specific task is categorized as Spoken Language Understanding. You can use Speechbrain for acoustic feature extraction from your audio files and then train a classifier for those features.

Questions from YouTube

  • What if a single input x1 corresponds to multiple characters? (In the context of CTC)

CTC uses very small timesteps (10ms or so), so there would be no case (in speech at least) in which there are multiple characters.

  • Is there a model for processing realtime audio streams?

CTC based algorithms are generally very fast and can be used in production, there is also a slightly sophisticated RNN-Transducer (RNN-T) which provides better accuracy over CTC.
You can read about both in detail in SLP chapter 26.4.4

  • are there any pretrained models for indian speech recognition, and for indian classifical music generation?

Yes for indian ASR. You can go to https://huggingface.co/models, at the left, in tasks, select Automatic Speech Recognition. Then select hi language. The end link is this, which gives 6 models, some being wav2vec2 fine-tuned models.

  • What are the best practices for segmenting very long pre-recorded speech inputs for something like wave2vec?

Typically you would segment and restrict your files and transcriptions to one sentence each. You might want to look at the huggingface fine tuning guide for more details: Fine-Tune Wav2Vec2 for English ASR in Hugging Face with 🤗 Transformers

4 Likes

SOTA on librispeech might be useful here: LibriSpeech test-other Benchmark (Speech Recognition) | Papers With Code

4 Likes