Hi! Does anyone know how CLS token is initialized in BERT? I mean, let’s say I would like to train a BERT model from scratch (which of course I’m not doing), how should I initialize CLS embedding? Just at random under some distribution such as uniform? How is this done in BERT?
Just like other tokens, the CLS token is randomly initialized from a normal distribution. The only exception is the padding token, which is set to zero.