Regarding outputs in Encoder


Say I have a sentence “This is hugging face”.
And I pass it through an encoder, say BERT.

The encoder takes in the attention mask and input ids and the output is (512,768) dimensional.

My question was, what do these outputs refer to? Each each (1,768) encoder output representation correspond to it’s corresponding input token. And if so, are only the first (3,768) representations relevant?