@sanchit-gandhi , when we set output_hidden_states=True for the Wav2Vec2 model, we get 13 tensors, where 12 correspond to the outputs from each Encoder Layer. What is the very first output tensor? In the BERT model this corresponds to the output from the embedding layer. In Wav2Vec2 is this the output of the feature extractor projected into some space in combination with positional information?
Hey @RajSang! Great question! That’s exactly right: the first hidden state is the output of the CNN layers with an added positional embedding, i.e. the latent speech representations that we pass into the first transformer layer: