Self Attention mechanism in Transformers

I need to ask a question about the attention mechanism in Transformers (referring to page 61 of O’Reilly book “Natural Language Processing with Transformers”). I am trying to compare the meaning and mechanism of what is named as Self Attention in Transformers with what I previously knew as Self attention from this paper: and local and general attention from the following : What it has been used in these papers was HAN model with Self,local or Global attention on top of RNN , GRU, LSTM or CNN layers. As Transformers are new architecture , I am wondering if the mathematics behind the attention is same as these 2 papers or not?

(post deleted by author)