ML for Audio Study Group - pyctcdecode (Jan 18)

Welcome to the fourth session of ML for Audio Study Group! :loud_sound: :loud_sound:

We have a very special webinar organized for you! The Kensho team will join us to give a cool presentation of pyctcdecode.

Topic: pyctcdecode: A simple and fast speech-to-text prediction decoding algorithm

Speakers

  • Raymond Grossman (LinkedIn: www.linkedin.com/in/raymond-grossman-bb4664114)
    Raymond works as a machine learning engineer at Kensho Technologies, specializing in speech and natural language domains. Prior to coming to Kensho, he studied mathematics at Princeton and was an avid Kaggler under the moniker @ToTrainThemIsMyCause.
  • Jeremy Lopez (LinkedIn: https://www.linkedin.com/in/jeremy-lopez-9107b613a)
    Jeremy is a machine learning engineer at Kensho Technologies and has worked on a variety of different topics including search and speech recognition. Before working at Kensho, he earned a PhD in experimental particle physics at MIT and continued doing physics research as a postdoc at the University of Colorado Boulder.

Suggested Resources (If you want to jump ahead)

How to join

You can post all your questions in this topic! They will be answered during the session

2 Likes

I was wondering how the hotword boosting is implemented. Is it simply changing the probabilities of the language model by a factor of x or is it something fancy?

2 Likes

What is the future roadmap for pyctcdecode?

1 Like
  • how many hotwords can we add to the lm?
  • how long does the build take?
  • can pyctcdecode handle foreign languages such as korean?

Hi! I am also interested in this topic - how is it implemented and what is the maths behind it? Please feel free to really dig deep into the details :slight_smile: Unfortunately I wonโ€™t be able to join this afternoon, but will watch the stream afterwards!

Furthermore I stumbled upon this discussion on your GitHub: Difficulty seeing meaningful changes with hotword boosting ยท Issue #18 ยท kensho-technologies/pyctcdecode ยท GitHub

A user is having issues where heโ€™s not seeing meaningful differences when using hotwords, even if upweighting the words to a very large number (like 9999999.0). I tried this myself and had the same experience. Can you please elaborate on this issue and if you have made any attempts to make it easier for users to finetune their LM:s for this specific purpose?

Hi. Thanks a lot for organizing this study group.
1- Please explain different approaches like Viterbi, WSFT, and beam search? What are the differences?
Please compare them in terms of accuracy and efficiency, too.

2- How to choose beam size for beam search? What is the best value or range for beam size especially if we want to compare different methods in reporting in a research paper?

3- Is beam size related to the acoustic model? Is it true that some models need a larger beam size for generating reasonable text sequences?

4- How to choose the number of subword pieces in BPE for decoding?
Thanks again.

During the training of a STT system, letโ€™s say Wav2Vec 2.0, do we include the language model (so we do CTC-decoding with an LM) during training?

Can we use it for rnn-based and attention based models to generate text?

Is it possible to use acoustic models with phoneme output with pyctc decode and add a lexicon?

Hi. Thanks a lot for organizing this study group.
1- Please explain different approaches like Viterbi, WSFT, and beam search? What are the differences?
Please compare them in terms of accuracy and efficiency, too.

2- How to choose beam size for beam search? What is the best value or range for beam size especially if we want to compare different methods in reporting in a research paper?

3- Is beam size related to the acoustic model? Is it true that some models need a larger beam size for generating reasonable text sequences?

4- How to choose the number of subword pieces in BPE for decoding?
Thanks again.