It was nice catching up with you yesterday during the live stream.
Below are the links to all the papers I referenced for your questions:
@amitness (answered at 41:33 and 44:10)
- Nowadays, it is almost a common practice to run your experiments directly on the speech signal itself. Those signals have much more robust featuers that the model can learn from.
A good example of this is explained in the translatotron article: Google AI Blog: Introducing Translatotron: An End-to-End Speech-to-Speech Translation Model
- Counting filler words or just removing filler words from a speech signal is an interesting problem and is also a very important speech enhancement use case as well.
a naive way of doing this would be to just flag disfluency in the speech, anywhere you see abrupt patterns/ breaks in the overall text flag them in the dataset and then train a classifier to look at 10ms breaks and classify them as “filler word” or not.
You can find a similar approach in this paper: https://arxiv.org/pdf/1812.03415.pdf
@ysharma (answered at 49:06
It is very much possible, different instruments have different energies and these can easily be separated by PCA/ SVD.
I found this paper: RPCA-based real-time speech and music separation method - ScienceDirect which attempts at doing that with a modified PCA routine.
@AlekseyDorkin (answered at 51:00)
I think my response to @amitness #1 should be sufficient to answer your Question but do let me know if you have any follow-up or clarification questions.
for your second question, you are absolutely correct about it, specifically for English we work with 10ms windows and this may change for other languages. We also employ another algorithm and loss function called CTC (Connectionist Temporal Classification)
“The intuition of CTC is to output a single character for every frame of the input, so that the output is the same length as the input, and then to apply a collapsing function that combines sequences of identical letters, resulting in a shorter sequence.” - SLP CH 26 (26.4)
We’ll be covering this a bit next Tuesday too.
Regarding colabs - you can find some ready-to-use colabs at https://speechbrain.github.io/ most of them are well commented and provide a good overview.