How does BERT know which contextualised embedding to choose for a word?


I am trying to explain BERT. I understand the concept of contextualised embedding, where one word has different embeddings depending on the context. I also understand that when by using bidirectionality, BERT can learn these contextualised embedding during pretraining.

My question is when finetuning BERT for a task and feeding it a sentence, how does BERT know which contextualised embedding to used for the word, given there are several to choose from?

I am new to the topic of Transformers, but my understanding (limited as it is) of BERT is that whilst the pre-trained embedding is created through semi-supervised learning (just large unannoted text corpora), when it comes to fine-tuning BERT for a specific task then this is usually a supervised learning process. So effectively you tell it what the right answers are and as part of the learning process in fine-tuning it will learn which contextualised embedding (and maybe a lot of other related knowledge) is relevant to what you want it to do.