Significance of the [CLS] token

Tbh this is a bit confusing.

This is how I like to think of the [CLS] token: a weighted average of the words such that the representation of the whole sequence is captured.

That’s the thing: it is not at all a weighted average - it is itself a special token that is pretrained and useful in fine-tuning, too.