Hello , I hope you are having a great day.
this question is a bit theoretical and I am struggling to find a full answer.
Could you explain how does projecting embedded tokens to keys ,query ,values and calculating softmax of keys*queries /sqrt(dk) then multiply each value vector by its weight and sum them then stack each weight value and project it again to new vector and forward to a neural network and do the same operation again and again gave the computers the ability of understand ,reasoning and think like a human ?
1 Like