have been trying to understand how the BERT model works. Specifically, I am trying to understand how it picks up up answers to questions on a given passage. I have tried following thisblog post and whilst It has given me a nice intuition, I would like to better understand what is happening under the hood.
From my understanding, the question and paragraph are tokenised separately and then go through the transformer model. Then, the dot product between the ‘transformed’ tokens and a START/END token is taken, with the higher result giving you that start and Finnish of the answer.
What I would like to understand, what happens to the tokens in this “transformation” (i.e feedforward through the model) that makes it possible to take a dot product and therefore indicate if a word is a START/END.