Recognizing timestamps for patterns in spectrogram using machine learning model

Hello,

I am interested in using a model for finding pre-defined patterns in a series of spectrograms. I need to find the timestamps for the patterns, and as the patterns are hard to define programmatically, I believe a machine learning model would be best suited for doing it. I have many samples of the patterns for the model to learn from, so training a model from scratch should be possible if necessary.

I have tried a few different approaches, from image segmentation of the spectrograms to audio classification, but can’t get them to produce satisfactory results. For the audio classification models, I’ve tried using each centisecond of the spectrogram as the range for the labels, but it seems to very quickly “overfit” to produce the same label for all samples.

Does anyone have any ideas on what type of models could be useful or if I could possibly label the data in a different way to make sure the models more properly understand what they are looking for?