Hello,
I have been reading the documentation of Biet model here. In the section pooler output
this is what is written
Last layer hidden-state of the first token of the sequence (classification token) after further processing through the layers used for the auxiliary pretraining task. E.g. for the BERT-family of models, this returns the classification token after processing through a linear layer and a tanh activation function. The linear layer weights are trained from the next sentence prediction (classification) objective during pretraining
After going through the code (here)[https://github.com/huggingface/transformers/blob/master/src/transformers/models/beit/modeling_beit.py#L666-L667], the pooler output is actually the mean of all hidden states and not a linear projection on the CLS
token.
Is it possible to update the documentation as it creates confusion while going through it?
Note: I thought of raising the issue in the GitHub repo but couldn’t find how to do it incase of documentation.