I have a task to detect a word in audio/video records (not real-time). It is a common English word but records can be in any language. So far, we have tried transcribing the files, but the word can be mistranscribed beyond recognition in some languages. So the idea is to try wake word detection.
I will be grateful for any recommendations. (We can live with false positives.)
There is a guide to the audio classification task.
You will need a dataset with your wake word. Then do not try and capture the word directly in all languages as that is noisy and the inference of that word can be found in different contextual positions based on the language. I suggest you try 2 different paths. A diffusion model could work well as noise can be reduced and overall context is preserved so your wake word would be found in all contextual representations despite the noise. The difficult part would be can diffusion process differentiate the background noise from applied noise.
It would probably be better to encode the audio in a latent space and then use a swiGLU network to learn to ignore noise and capture the contextual wake word. This can be pretty strait foward. I would suggest first encoding the audio it a latent space such as a context vector and then using 2 to 3 sequential layers that can cut through the noise to the contextual wake word. This should be easily achievable.
Thanks! I did the course, of course. The thing is the word is not part of any of the datasets and none of the mentioned models was trained to detect it.
Thank you! I have not considered these options. And have not even known about them. Any ideas on the required size of the dataset?
The size of the dataset would be dependent on the direction you choose to go. If you go the diffusion route you will need a substantial and diverse dataset to achieve generalization. If you use a GLU or swiGLU sequential networks you could probably get by on smaller dataset but quality will matter far more then quantity.
diffusion: 1 million +
GLU/swiGLU: probably can do 100k
These are rough estimates and are very dependent on quality and diversity.
Variation and quality will be major factors for accuracy and ensuring diversity variance will help with generalization which is what you want.
Also you must factor in noise as real data is not clean so you will need to clean that data as best as possible or ensure when you convert it to a latent space that tag the wake word with a contextual marker this can be a POS tag or a fixed embedding of the wake word. So in the case of a GLU you provide the noisy based latent space vector and the fixed wake word embedding so the model will learn to differentiate the wake word from the noise. If you pass both into a sequential GLU’s at least 2 to the first should reduce the noise and the second should target the wake word with in the latent vector and from that it can generalize. But you will diversity for generalization across variational language structure and meaning nuances.