I have a question of understanding about BERT training. When BERT is pre-trained, do you already use the so-called pooling layer during the pre-training?
At the moment it seems to me as if the pooling layer is only relevant for sequence classification and I could be wrong. But in the code its standard set to true. What exact purpose has this pooling layer?