Tbh this is a bit confusing.
This is how I like to think of the [CLS] token: a weighted average of the words such that the representation of the whole sequence is captured.
That’s the thing: it is not at all a weighted average - it is itself a special token that is pretrained and useful in fine-tuning, too.