Self Attention mechanism in Transformers

I need to ask a question about the attention mechanism in Transformers (referring to page 61 of O’Reilly book “Natural Language Processing with Transformers”). I am trying to compare the meaning and mechanism of what is named as Self Attention in Transformers with what I previously knew as Self attention from this paper:https://aclanthology.org/N16-1174.pdf and local and general attention from the following :https://arxiv.org/pdf/1508.04025.pdf What it has been used in these papers was HAN model with Self,local or Global attention on top of RNN , GRU, LSTM or CNN layers. As Transformers are new architecture , I am wondering if the mathematics behind the attention is same as these 2 papers or not?

(post deleted by author)