I have a very good understanding of traditional transformer architectures for NLP, but I have recently been given a task which requires raw audio.
I understand that tokenizers for BERT map a word to a specific indice, where that indice points to a word vector in a dictionary.
But when I feed raw audio data into Wav2Vec2FeatureExtractor function, where the raw audio data looks like
tensor([ 3.9514e-06, 1.0558e-04, -4.7315e-06, ..., 3.9716e-04, 2.4415e-04, 9.1544e-05])
I get back a bunch of float values, which look like
tensor([[-0.0020, -0.0001, -0.0021, ..., 0.0051, 0.0024, -0.0004]])
What are these features that are being generated with Wav2Vec2FeatureExtractor. In NLP, the words are mapped to some vector representation, and that vector is the feature representation. So what are these frequencies from the raw audio being mapped to?