BERT's hidden states don't have a standard deviation near 1

psoulos · October 3, 2023, 8:28pm

The output for each of BERT’s hidden states comes from LayerNorm. I expected the standard deviation of the hidden states to be near 1 because of this normalization. However, when I inspect the hidden states, the values are quite small. Why would this be the case?

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
outputs = model(**inputs, output_hidden_states=True)
for hidden_state in outputs['hidden_states']:
  print(f'Hidden State {hidden_state}')
  print(f'Hidden state std: {hidden_state.std(dim=-1)}')
  print()

Hidden State tensor([[[ 0.1664, -0.0541, -0.0014, …, -0.0811, 0.0794, 0.0155],
[-0.4229, 0.1071, -0.3010, …, 0.0352, -0.3372, 0.2603],
[ 0.5254, 0.1029, -0.0767, …, -0.6114, -0.2440, 0.2591],
…,
[ 0.2794, 0.0381, -0.0276, …, 0.1147, -0.0178, -0.0976],
[ 0.0204, 0.4912, 0.1750, …, 0.4872, -0.2833, -0.0511],
[ 0.1736, -0.1560, 0.0525, …, 0.3813, 0.1285, 0.1339]]],
grad_fn=NativeLayerNormBackward0

Hidden state std: tensor([[0.2599, 0.3309, 0.2946, 0.3154, 0.3333, 0.3045, 0.3281, 0.2870]],
grad_fn=StdBackward0)

…

The following colab notebook can be used to reproduce the results: BERT hidden state standard deviation - Colaboratory (google.com).

psoulos · October 5, 2023, 3:25pm

For those who find this later, the affine transformation in LayerNorm changes the standard deviation of the output. If you turn off the affine transformation, you get a standard deviation near 1.

Topic		Replies	Views
Transformer "output_hidden_states" format 🤗Transformers	3	697	July 9, 2023
Question about last_hidden_state of the bert model Beginners	0	330	December 7, 2023
Hidden states embedding tensors 🤗Transformers	5	3994	July 22, 2023
How to yield hidden_states from a saved, fine-tuned (distil)bert model? 🤗Transformers	2	401	July 12, 2020
Hidden_states Transformers for computer vision 🤗Transformers	0	422	July 21, 2022

BERT's hidden states don't have a standard deviation near 1

Related topics