How to implement Key Query Layer Normalized Transformers/LLMs in Huggingface?

brando · February 18, 2023, 8:44pm

I recently law this Twitter thread discussion https://twitter.com/_basilM/status/1625185484837208082?s=20 where they discuss that adding a layer norm to the Key & Query matrices stabalizes training in transformers/Large Language Models (LLMs)/Foundation Models. I wanted this stability – especially in pre-trained HuggingFace (HF) models.

Thus, my question is: What is the recommended/best way to add this layer norm to the K/Q (Key Query) activation matrices to (any) hugging face models?

I assume it doesn’t really make a big difference if it’s pre-trained or not (hopefully). I assume I could copy paste some code and put it in but wanted to see what is there was a less naive way to do it. I assume to deal with pre-trained models if I do my current suggestion, I can simply load the weights and fine-tune it a little to make sure the old weights “become aware” there is a new layer norm layer there (and adapt any missing parameters in the layer norm if needed).

Thoughts welcomed!

(other suggestions not using HF are welcomed too! But I assume they basically require implementing transformers form scratch and adding layer norm using either pytorch/TF/jax)

refs:

Topic		Replies	Views
Problem with Adding LayerNorm after BART's Encoder for Summarization 🤗Transformers	0	395	May 16, 2022
Pre/Post Normalization-layers 🤗Transformers	0	90	June 3, 2024
Dropout before layer normalization 🤗Transformers	0	1015	December 15, 2020
Is there any way to control the input of a `Longformer` layer? 🤗Transformers	1	253	October 14, 2020
How to train new token embedding to add to a pretrain model? 🤗Transformers	1	3652	January 6, 2021

How to implement Key Query Layer Normalized Transformers/LLMs in Huggingface?

Related topics