Bert for audio classification

I have been thinking at a very high abstract level about using Bert for something like audio classification. Suppose I have a time series data set of sampled sounds and their labels, something like an short audio clip of a dog barking that has the label “dog_bark”. I’m wondering if it’s possible to use the Bert architecture to perform this classification?

Naively, I would say that one would have to pre-train Bert from scratch since the input data is time series data represented by floats. That would also lead me to think that one would have to also reconsider how they perform the token embeddings. I don’t have any super concrete ideas, but that was where I was starting. Curious if others had similar ideas or thoughts on the matter?

EDIT: I am aware that there are other models out there better suited for this that perhaps fall into ASR or audio classification like wav2vec. However, in this instance I was specifically curious about adapting bert to the task.