A potential method to add emotional implicit memory and explicit memory to transformers

I have been playing with a method that bolts onto the attention block of GPT-2 and allows a persistent external memory to be added to a fraction of specified layers tokens-heads. When activated on certain layers with a positive memory it appears to perform as a crude emotional implicit memory. For recall it only appears to require a fraction of one layer (5 out of 2160 total token-heads on one example) and appears to be context length independent. More details are available in the write-up I put together on github, but a more capable model than GPT-2 is necessary to give a clearer answer as to how well this method actually works.

As this is a bolt-on method, in theory, it shouldn’t be a complex task to adapt it to more capable models with possible caveats such as how positional encoding is done. Needless to say, I am quite curious how this method would work on a more capable model and if someone is curious enough to try it out on a more capable model, I would love to hear about it.

Github page:
https://github.com/MTMTransformer/MTMTransformer